掌握 NLP 从基础到LLMs
Mastering NLP from Foundations to LLMs
将基于规则的先进技术应用于LLMs并使用 Python 解决现实世界的业务问题
Apply advanced rule-based techniques to LLMs and solve real-world business problems using Python
利奥尔·加齐特
Lior Gazit
梅萨姆·加法里
Meysam Ghaffari
版权所有 © 2024 Packt 出版
Copyright © 2024 Packt Publishing
版权所有。未经出版商事先书面许可,不得复制本书的任何部分、将其存储在检索系统中或以任何形式或任何方式传播,批评文章或评论中嵌入的简短引文除外。
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
本书的编写过程中已尽一切努力确保所提供信息的准确性。然而,本书中包含的信息在出售时不提供任何明示或暗示的保证。作者、Packt Publishing 或其经销商和分销商均不对因本书直接或间接造成或声称造成的任何损害承担责任。
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the authors, nor Packt Publishing or its dealers and distributors, will be held liable for any damages caused or alleged to have been caused directly or indirectly by this book.
Packt Publishing 致力于通过适当使用大写字母来提供本书中提到的所有公司和产品的商标信息。然而,Packt Publishing 无法保证该信息的准确性。
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
集团产品经理:Ali Abidi
Group Product Manager: Ali Abidi
出版产品经理:Ali Abidi
Publishing Product Manager: Ali Abidi
图书项目经理:Hemangi Lotlikar
Book Project Manager: Hemangi Lotlikar
内容开发编辑:Priyanka Soam
Content Development Editor: Priyanka Soam
技术编辑:Rahul Limbachiya
Technical Editor: Rahul Limbachiya
文案编辑:Safis 编辑
Copy Editor: Safis Editing
校对:萨菲斯编辑
Proofreader: Safis Editing
索引器:Rekha Nair
Indexer: Rekha Nair
制作设计师:Gokul Raj ST
Production Designer: Gokul Raj S.T
高级 DevRel 营销协调员:Vinishka Kalra
Senior DevRel Marketing Coordinator: Vinishka Kalra
首次发布:2024 年 4 月
First published: April 2024
生产编号:1040424
Production reference: 1040424
由...出版
Published by
帕克特出版有限公司
Packt Publishing Ltd.
格罗夫纳之家
Grosvenor House
11 圣保罗广场
11 St Paul’s Square
伯明翰
Birmingham
B3 1RB,英国。
B3 1RB, UK.
国际标准书号:978-1-80461-918-6
ISBN: 978-1-80461-918-6
自然语言处理( NLP ) 是一个令人困惑的问题的核心:人类和计算机这两个截然不同的实体如何才能真正相互沟通?人类语言是社会和生物进化的复杂、不完美的产物。它充满了不合逻辑的例外、微妙的细微差别和多层次的抽象思维。相比之下,计算机通过数学模型进行通信,无论多么复杂,数学模型都遵循一组逻辑的、可验证的规则。随着数字系统在人类活动中发挥越来越大的作用,它们必须能够正确解释人类所说的话的实际含义。
Natural language processing (NLP) lies at the heart of a perplexing question – how can two radically different entities – humans and computers – truly communicate with one another? Human language is the complex, imperfect product of social and biological evolution. It’s filled with illogical exceptions, subtle nuance, and multiple levels of abstract thinking. In contrast, computers communicate via mathematical models that, however complex, follow a logical, verifiable set of rules. As digital systems assume an ever greater role in human activity, they must be able to correctly interpret what humans actually mean from the words they say.
Lior Gazit 和 Meysam Ghaffari 的新书《Mastering NLP from Foundations to LLMs》是实现这一目标的重要资源。这本书是为从事文本工作的技术专业人士(从初学者到经验丰富的 NLP 专业人士)编写的,为应对本世纪最艰巨的挑战之一制定了实用策略。它通过 NLP 和大型语言模型( LLM )的复杂领域绘制了细致的课程,引导您了解基本概念,直至当代人工智能的顶峰。
Lior Gazit and Meysam Ghaffari’s new book, Mastering NLP from Foundations to LLMs, is a monumental resource for making that happen. Written for technology professionals who work with text – from beginners to seasoned NLP pros – the book lays out a practical strategy for one of this century’s most daunting challenges. It charts a meticulous course through the intricate realms of NLP and large language models (LLMs), guiding you through fundamental concepts to the apex of contemporary artificial intelligence.
这本书借鉴了 Gazit 对快节奏金融世界的沉浸感和 Ghaffari 在医疗保健领域的创新 NLP 发展。结果是技术深度和实际相关性之间实现了罕见的平衡。他们综合的专业知识塑造了这本书,内容丰富,实用见解丰富。每位作者的独特影响力丰富了故事内容——结合了 Gazit 对机器学习( ML ) 推动增长力量的理解和 Ghaffari 应用 ML造福社会的人文方法。
The book draws from Gazit’s immersion in the fast-paced world of finance and Ghaffari’s innovative NLP development in the healthcare sector. The result is a rare balance between technical depth and practical relevance. Their combined expertise shapes a book as rich in information as it is robust in practical insights. Each author’s distinct influence enriches the narrative – combining Gazit’s understanding of the power of machine learning (ML) to drive growth and Ghaffari’s humanistic approach to applying ML for societal good.
Gazit 和 Ghaffari 在支持 NLP 复杂算法的基本数学和统计支柱方面奠定了坚实的基础。他们采用从基本原理到高级应用逐步构建的教学策略,确保清晰的学习和理解轨迹。
Gazit and Ghaffari lay a solid foundation in the essential mathematical and statistical pillars supporting the complex algorithms of NLP. They employ a pedagogical strategy that progressively builds from basic principles to advanced applications, ensuring a clear trajectory of learning and comprehension.
随着叙述的进展,它深入探讨了机器学习模型的工程。我们将指导您构建模型、应用以及拟合性与普遍性之间的微妙平衡。对文本预处理的探索是彻底的,为您提供了有效准备 NLP 任务数据的基本工具,从标记化到命名实体识别的微妙艺术。
As the narrative progresses, it delves deep into the engineering of ML models. You are guided through model construction, application, and the nuanced balance of fit versus generalizability. The exploration of text preprocessing is thorough, equipping you with essential tools to effectively prepare data for NLP tasks, from tokenization to the subtle art of named entity recognition.
本书的核心内容——LLMs——以细心和深厚的专业知识的融合而揭开面纱。作者阐明了标志着LLMs崛起的理论基础、发展挑战和突破,引导您思考这些强大技术的方向。作者还提供了有关建立和访问LLMs的实用建议,为您提供在工作中利用这些模型的可行路径。这揭开了将高级模型集成到实际用例中这一有时令人畏惧的任务的神秘面纱。
The book’s centerpiece – LLMs – is unveiled with a blend of care and deep expertise. The authors articulate the theoretical foundations, developmental challenges, and breakthroughs that mark the ascent of LLMs, guiding you to contemplate the direction of these formidable technologies. The authors also provide practical advice on setting up and accessing LLMs, providing you with actionable paths to harness these models in your work. This demystifies the sometimes daunting task of integrating advanced models into practical use cases.
在富有远见的章节中,本书深入探讨了 RAG 和 LangChain 等先进技术的功能,让人们一睹人工智能管理日益复杂的任务的自动化未来。这种叙述不仅具有教育意义,而且具有启发性,揭示了LLMs在提高绩效和促进更大创新方面的潜力。
In its visionary chapters, the book plunges into the functionalities of advanced technologies such as RAG and LangChain, providing a glimpse into an automated future where AI manages increasingly complex tasks. This narrative not only educates but inspires, mapping out the potential of LLMs to enhance performance and facilitate even greater innovation.
最后通过一系列专家访谈,从基础到LLMs掌握 NLP提供了多种视角,通过各个行业的实际应用丰富了叙述。这本非凡的书展示了 NLP 和LLMs推动的广泛变革——为技术专业人士提供了一个详细的路线图,让他们自己可以在未来发挥重要作用。对于任何想要获得LLMs席位的人来说,这是一本必读之书。
Concluding with a series of expert interviews, Mastering NLP from Foundations to LLMs offers a diverse array of perspectives that enrich the narrative with real-world applications across various industries. This remarkable book showcases the widespread transformation being driven by NLP and LLMs – giving technology professionals a detailed roadmap to a future in which they themselves can have a major role. It is a must-read for anyone who wants a seat at the LLM table.
阿莎·萨克塞纳,
Asha Saxena,
企业家、教授和人工智能战略家
Entrepreneur, Professor, and AI Strategist
简介:https://ashasaxena.com/about-bio/
Bio: https://ashasaxena.com/about-bio/
Lior Gazit是一位技术精湛的 ML 专业人士,在建立和领导使用 ML 推动业务增长的团队方面拥有良好的成功记录。他是 NLP 领域的专家,并成功开发了创新的 ML 管道和产品。他拥有硕士学位,并在同行评审的期刊和会议上发表过文章。作为金融领域机器学习团队的高级总监和新兴初创企业的首席机器学习顾问,Lior 是业内受人尊敬的领导者,拥有丰富的知识和经验可供分享。Lior 充满热情和灵感,致力于使用机器学习来推动其组织的积极变革和发展。
Lior Gazit is a highly skilled ML professional with a proven track record of success in building and leading teams that use ML to drive business growth. He is an expert in NLP and has successfully developed innovative ML pipelines and products. He holds a master’s degree and has published in peer-reviewed journals and conferences. As a senior director of a ML group in the financial sector and a principal ML advisor at an emerging start-up, Lior is a respected leader in the industry, with a wealth of knowledge and experience to share. With much passion and inspiration, Lior is dedicated to using ML to drive positive change and growth in his organizations.
Meysam Ghaffari是一位高级数据科学家,在 NLP 和深度学习方面拥有深厚的背景。他目前在 MSKCC 工作,专门开发和改进针对医疗保健问题的 ML 和 NLP 模型。他在 ML 领域拥有超过 9 年的经验,在 NLP 和深度学习领域拥有超过 4 年的经验。他在佛罗里达州立大学获得计算机科学博士学位,在伊斯法罕理工大学获得计算机科学人工智能硕士学位,在伊朗科技大学获得计算机科学学士学位。在加入 MSKCC之前,他还曾在威斯康星大学麦迪逊分校担任博士后研究员。
Meysam Ghaffari is a senior data scientist with a strong background in NLP and deep learning. He currently works at MSKCC, where he specializes in developing and improving ML and NLP models for healthcare problems. He has over nine years of experience in ML and over four years of experience in NLP and deep learning. He received his PhD in computer science from Florida State University, his MS in computer science – artificial intelligence from Isfahan University of Technology, and his BS in computer science from Iran University of Science and Technology. He also worked as a post-doctoral research associate at the University of Wisconsin-Madison before joining MSKCC.
Amreth Chandrasehar是云、AI/ML 工程、可观测性、自动化和 SRE 领域的工程领导者。在过去的几年里,Amreth 在各个组织的云迁移、生成式 AI、AIOps、可观察性和 ML 采用方面发挥了关键作用。Amreth 也是 Conduktor Platform 平台的共同创建者以及多家 Observability 公司的技术/客户咨询委员会成员。Amreth 还共同创建并开源了服务健康仪表板工具 Kardio.io。Amreth 曾受邀在多个重要会议上发表演讲,并获得多项奖项。我要感谢我的妻子 Ashwinya 和儿子 Athvik 在我审阅本书期间的耐心和支持。
Amreth Chandrasehar is an Engineering Leader in Cloud, AI/ML Engineering, Observability, Automation, and SRE. Over the last few years, Amreth has played a key role in Cloud Migration, Generative AI, AIOps, Observability, and ML adoption at various organizations. Amreth is also a co-creator of the Conduktor Platform Platform and a Tech/Customer Advisory board member at various companies on Observability. Amreth has also co-created and open-sourced Kardio.io, a service health dashboard tool. Amreth has been invited and spoken at several key conferences and has received several awards. I would like to thank my wife Ashwinya and my son Athvik for their patience and support during my review of this book.
Shivani Modi是一位数据科学家,在机器学习、深度学习和 NLP 方面拥有丰富的背景,并获得了哥伦比亚大学的硕士学位。她的职业生涯横跨 IBM、SAP、C3 AI 等职位,并在 Konko AI 担任领导职务,专注于可扩展的 AI 模型和创新的LLMs工具。Shivani 对指导和道德人工智能应用的奉献精神通过她的咨询角色和对技术社会效益的承诺显而易见。她即将开展的项目旨在提高开发人员对LLMs的利用率,优先考虑安全性和效率。
Shivani Modi is a data scientist with a rich background in machine learning, deep learning, and NLP, having earned her Master’s from Columbia University. Her career spans roles at IBM, SAP, C3 AI, and leadership at Konko AI, focusing on scalable AI models and innovative LLM tools. Shivani's dedication to mentoring and ethical AI application is evident through her advisory roles and commitment to the societal benefits of technology. Her upcoming projects aim to enhance LLM utilization for developers, prioritizing security and efficiency.
本书由作者、技术专家和专业出版团队共同创作。书中表达的观点仅代表作者而非其雇主。
This book has been created by authors, technical experts, and a professional publishing team. The views expressed in the book are of the authors and not their employers.
本书深入介绍了自然语言处理( NLP ) 技术,从机器学习( ML )的数学基础开始,一直到高级 NLP 应用程序,例如大型语言模型( LLM ) 和人工智能应用程序。作为学习体验的一部分,您将掌握线性代数、优化、概率和统计,这对于理解和实现 ML 和 NLP 算法至关重要。您还将探索一般的 ML 技术并了解它们与 NLP 的关系。在学习如何执行文本分类之前,将介绍文本数据的预处理,包括清理和准备文本进行分析的方法,文本分类是根据文本内容为其分配标签或类别的任务。LLMs的理论、设计和应用的高级主题将在本书的最后讨论,自然语言处理的未来趋势也将讨论,其中将包含专家对该领域未来的意见。为了增强您的实践技能,您还将研究模拟的现实世界 NLP 业务问题和解决方案。
This book provides an in-depth introduction to natural language processing (NLP) techniques, starting with the mathematical foundations of machine learning (ML) and working up to advanced NLP applications such as large language models (LLMs) and AI applications. As part of your learning experience, you’ll get to grips with linear algebra, optimization, probability, and statistics, which are essential for understanding and implementing ML and NLP algorithms. You’ll also explore general ML techniques and find out how they relate to NLP. The preprocessing of text data, including methods for cleaning and preparing text for analysis, will follow, right before you learn how to perform text classification, which is the task of assigning a label or category to a piece of text based on its content. The advanced topics of LLMs’ theory, design, and applications will be discussed toward the end of the book, as will the future trends in NLP, which will feature expert opinions on the future of the field. To strengthen your practical skills, you’ll also work on mocked real-world NLP business problems and solutions.
本书面向技术人员,包括深度学习和 ML 研究人员、NLP 实践者、ML/NLP 教育工作者以及 STEM 学生。在项目中使用文本的专业人士以及现有的 NLP 从业者也将在本书中找到大量有用的信息。初级 ML 知识和 Python 基本工作知识将帮助您充分利用本书。
This book is for technical folks, ranging from deep learning and ML researchers, hands-on NLP practitioners, and ML/NLP educators, to STEM students. Professionals working with text as part of their projects and existing NLP practitioners will also find plenty of useful information in this book. Beginner-level ML knowledge and a basic working knowledge of Python will help you get the best out of this book.
第 1 章,探索 NLP 领域:综合介绍,解释了本书的内容、我们将涵盖哪些主题以及谁可以使用本书。本章将帮助您决定这本书是否适合您。
Chapter 1, Navigating the NLP Landscape: A Comprehensive Introduction, explains what the book is about, which topics we will cover, and who can use this book. This chapter will help you decide whether this book is the right fit for you or not.
第 2 章,掌握机器学习和 NLP 的线性代数、概率和统计,分为三个部分。在第一部分中,我们将回顾本书不同部分所需的线性代数基础知识。在下一部分中,我们将回顾统计学的基础知识,最后,我们将介绍基本的统计估计器。
Chapter 2, Mastering Linear Algebra, Probability, and Statistics for Machine Learning and NLP, has three parts. In the first part, we will review the basics of linear algebra that are needed at different parts of the book. In the next part, we will review the basics of statistics, and finally, we will present basic statistical estimators.
第 3 章,释放 NLP 中的机器学习潜力,讨论 ML 中可用于解决 NLP 问题的不同概念和方法。我们将讨论一般特征选择和分类技术。我们将涵盖机器学习问题的一般方面,例如训练/测试/验证选择以及处理不平衡数据集。我们还将讨论评估 NLP 问题中使用的 ML 模型的性能指标。我们将解释这些方法背后的理论以及如何在代码中使用它们。
Chapter 3, Unleashing Machine Learning Potentials in NLP, discusses different concepts and methods in ML that can be used to tackle NLP problems. We will discuss general feature selection and classification techniques. We will cover general aspects of ML problems, such as train/test/validation selection, and dealing with imbalanced datasets. We will also discuss performance metrics for evaluating ML models that are used in NLP problems. We will explain the theory behind the methods as well as how to use them in code.
第 4 章,简化文本预处理技术以实现最佳 NLP 性能,讨论现实世界问题背景下的各种文本预处理步骤。我们将根据要解决的场景解释哪些步骤适合哪些需求。本章将介绍和回顾一个完整的 Python 管道
Chapter 4, Streamlining Text Preprocessing Techniques for Optimal NLP Performance, talks about various text preprocessing steps in the context of real-world problems. We will explain which steps suit which needs, based on the scenario that is to be solved. There will be a complete Python pipeline presented and reviewed in this chapter.
第 5 章,增强文本分类能力:利用传统机器学习技术,解释如何执行文本分类。还将解释理论和实现。一个全面的 Python 笔记本将作为案例研究进行介绍。
Chapter 5, Empowering Text Classification: Leveraging Traditional Machine Learning Techniques, explains how to perform text classification. Theory and implementation will also be explained. A comprehensive Python notebook will be covered as a case study.
第 6 章,重新构想文本分类:深入研究深度学习语言模型,涵盖了可以使用深度学习神经网络解决的问题。我们将向您介绍此类别中的不同问题,以便您学习如何有效地解决它们。这里将解释这些方法的理论,并将涵盖一个全面的 Python 笔记本作为案例研究。
Chapter 6, Text Classification Reimagined: Delving Deep into Deep Learning Language Models, covers the problems that can be solved using deep learning neural networks. The different problems in this category will be introduced to you so you can learn how to efficiently solve them. The theory of the methods will be explained here and a comprehensive Python notebook will be covered as a case study.
第 7 章,揭秘大型语言模型:理论、设计和 Langchain 实现,概述了LLMs开发和使用背后的动机,以及其创建过程中面临的挑战。通过检查最先进的模型设计,您将全面了解LLMs的理论基础和实际应用。
Chapter 7, Demystifying Large Language Models: Theory, Design, and Langchain Implementation, outlines the motivations behind the development and usage of LLMs, alongside the challenges faced during their creation. Through an examination of state-of-the-art model designs, you will gain comprehensive insights into the theoretical underpinnings and practical applications of LLMs.
第 8 章,访问大型语言模型的力量:高级设置和与 RAG 集成,指导您设置基于 API 和开源的 LLM 应用程序,并通过 LangChain 深入研究即时工程和 RAG。我们将回顾代码中的实际应用。
Chapter 8, Accessing the Power of Large Language Models: Advanced Setup and Integration with RAG, guides you through setting up LLM applications, both API-based and open source, and delves into prompt engineering and RAGs via LangChain. We will review practical applications in code.
第 9 章,探索前沿:LLM 驱动的高级应用和创新,深入探讨使用 RAG 提高 LLM 性能,探索先进方法、自动 Web 源检索、即时压缩、API 成本降低和协作多代理 LLM 团队,推动当前LLM申请的边界。在这里,您将查看多个 Python 笔记本,每个笔记本处理实际用例的不同高级解决方案。
Chapter 9, Exploring the Frontiers: Advanced Applications and Innovations Driven by LLMs, dives into enhancing LLM performance using RAG, exploring advanced methodologies, automatic web source retrieval, prompt compression, API-cost reduction, and collaborative multi-agent LLM teams, pushing the boundaries of current LLM applications. Here, you will review multiple Python notebooks, each handling different advanced solutions to practical use cases.
第 10 章,驾驭浪潮:分析LLMs和人工智能塑造的过去、现在和未来趋势,深入探讨LLMs和人工智能对技术、文化和社会的变革性影响,探索关键趋势、计算进步、大型数据集的重要性以及LLMs在商业及其他领域的演变、目的和社会影响。
Chapter 10, Riding the Wave: Analyzing Past, Present, and Future Trends Shaped by LLMs and AI, dives into the transformative impact of LLMs and AI on technology, culture, and society, exploring key trends, computational advancements, the significance of large datasets, and the evolution, purpose, and social implications of LLMs in business and beyond.
第 11 章,独家行业洞察:世界级专家的观点和预测,通过与法律、研究和执行角色专家的对话,深入探讨未来 NLP 和 LLM 的趋势,探索挑战、机遇以及 LLM 与专业人士的交叉点实践和道德考虑。
Chapter 11, Exclusive Industry Insights: Perspectives and Predictions from World Class Experts, offers a deep dive into future NLP and LLM trends through conversations with experts in legal, research, and executive roles, exploring challenges, opportunities, and the intersection of LLMs with professional practices and ethical considerations.
本书中提供的所有代码均以 Jupyter Notebook 的形式呈现。所有代码均使用 Python 3.10.X 开发,预计也适用于更高版本。
All the code presented in this book is in the form of a Jupyter notebook. All the code was developed with Python 3.10.X and is expected to work on later versions as well.
|
书中涉及的软件/硬件 Software/hardware covered in the book |
操作系统要求 Operating system requirements |
|
通过以下方式之一访问 Python 环境: Access to a Python environment via one of the following:
|
Windows、macOS 或Linux Windows, macOS, or Linux |
|
足够的计算资源,如下: Sufficient computation resources, as follows:
|
由于本书中的代码示例具有多样化的用例,对于一些高级 LLM 解决方案,您将需要一个 OpenAI 帐户,该帐户将允许API 密钥。
As the code examples in this book have a diversified set of use cases, for some of the advanced LLM solutions, you will need an OpenAI account, which will allow an API key.
如果您使用的是本书的数字版本,我们建议您自己键入代码或从本书的 GitHub 存储库访问代码(下一节中提供了链接)。这样做将帮助您避免与复制和粘贴代码相关的任何潜在错误。
If you are using the digital version of this book, we advise you to type the code yourself or access the code from the book’s GitHub repository (a link is available in the next section). Doing so will help you avoid any potential errors related to the copying and pasting of code.
您可以从 GitHub 下载本书的示例代码文件:https://github.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs如果代码有更新,它将在GitHub 存储库中更新。
You can download the example code files for this book from GitHub at https://github.com/PacktPublishing/Mastering-NLP-from-Foundations-to-LLMs If there’s an update to the code, it will be updated in the GitHub repository.
在整本书中,我们回顾了代表专业行业级别解决方案的完整代码笔记本:
Throughout the book we review complete code notebooks that represent solutions on a professional industry level:
|
章节 Chapter |
笔记本名称 Notebook Name |
|
4 4 |
Ch4_Preprocessing_Pipeline.ipynb Ch4_Preprocessing_Pipeline.ipynb Ch4_NER_and_POS.ipynb Ch4_NER_and_POS.ipynb |
|
5 5 |
Ch5_Text_Classification_Traditional_ML.ipynb Ch5_Text_Classification_Traditional_ML.ipynb |
|
6 6 |
Ch6_Text_Classification_DL.ipynb Ch6_Text_Classification_DL.ipynb |
|
8 8 |
Ch8_Setting_Up_Close_Source_and_Open_Source_LLMs.ipynb Ch8_Setting_Up_Close_Source_and_Open_Source_LLMs.ipynb Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb |
|
9 9 |
Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb Ch9_Advanced_Methods_with_Chains.ipynb Ch9_Advanced_Methods_with_Chains.ipynb Ch9_Completing_a_Complex_Analysis_with_a_Team_of_LLM_Agents.ipynb Ch9_Completing_a_Complex_Analysis_with_a_Team_of_LLM_Agents.ipynb Ch9_RAGLamaIndex_Prompt_Compression.ipynb Ch9_RAGLlamaIndex_Prompt_Compression.ipynb Ch9_Retrieve_Content_from_a_YouTube_Video_and_Summarize.ipynb Ch9_Retrieve_Content_from_a_YouTube_Video_and_Summarize.ipynb |
我们还提供丰富的书籍和视频目录中的其他代码包,网址为https://github.com/PacktPublishing/。去看一下!
We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!
本书中使用了许多文本约定。
There are a number of text conventions used throughout this book.
文本中的代码:表示文本中的代码字、数据库表名称、文件夹名称、文件名、文件扩展名、路径名、虚拟 URL、用户输入和 Twitter 句柄。这是一个例子:“现在,我们添加一个用于实现语法的功能。我们定义了output_parser变量,并使用不同的函数来生成输出,predict_and_parse()。”
Code in text: Indicates code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles. Here is an example: “Now, we add a feature for achieving the syntax. We define the output_parser variable, and we use a different function for generating the output, predict_and_parse().”
代码块设置如下:
A block of code is set as follows:
将 pandas 导入为 pd
将 matplotlib.pyplot 导入为 plt
# 从 URL 加载记录字典
导入请求
进口泡菜
import pandas as pd
import matplotlib.pyplot as plt
# Load the record dict from URL
import requests
import pickle 当我们希望引起您对代码块的特定部分的注意时,相关行或项目会设置为粗体:
When we wish to draw your attention to a particular part of a code block, the relevant lines or items are set in bold:
qa_engineer(致 manager_0): exitcode: 0 (执行成功) 代码输出: 图(640x480) 程序员(对 manager_0): 终止
qa_engineer (to manager_0): exitcode: 0 (execution succeeded) Code output: Figure(640x480) programmer (to manager_0): TERMINATE
粗体:表示新术语、重要单词或您在屏幕上看到的单词。例如,菜单或对话框中的单词以粗体显示。下面是一个示例:“虽然我们选择了一个特定数据库,但您可以参考Vector Store页面以了解有关不同选择的更多信息。”
Bold: Indicates a new term, an important word, or words that you see onscreen. For instance, words in menus or dialog boxes appear in bold. Here is an example: “While we chose one particular database, you can refer to the Vector Store page to read more about the different choices.”
提示或重要说明
Tips or important notes
看起来像这样。
Appear like this.
我们随时欢迎读者提供反馈。
Feedback from our readers is always welcome.
一般反馈:如果您对本书的任何方面有疑问,请发送电子邮件至customercare@packtpub.com ,并在邮件主题中提及书名。
General feedback: If you have questions about any aspect of this book, email us at customercare@packtpub.com and mention the book title in the subject of your message.
勘误表:尽管我们已尽一切努力确保内容的准确性,但错误还是会发生。如果您发现本书中有错误,请向我们报告,我们将不胜感激。请访问www.packtpub.com/support/errata并填写表格。
Errata: Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you have found a mistake in this book, we would be grateful if you would report this to us. Please visit www.packtpub.com/support/errata and fill in the form.
盗版:如果您在互联网上发现任何形式的非法复制品,请向我们提供位置地址或网站名称,我们将不胜感激。请通过copyright@packt.com联系我们并提供材料链接。
Piracy: If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at copyright@packt.com with a link to the material.
如果您有兴趣成为一名作家:如果您有某个主题的专业知识并且您有兴趣撰写或撰写书籍,请访问authors.packtpub.com。
If you are interested in becoming an author: If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, please visit authors.packtpub.com.
请留下评论。一旦您阅读并使用了这本书,为什么不在您购买该书的网站上留下评论呢?然后,潜在读者可以看到并使用您的公正意见来做出购买决定,Packt 可以了解您对我们产品的看法,我们的作者可以看到您对他们的书的反馈。谢谢你!
Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at Packt can understand what you think about our products, and our authors can see your feedback on their book. Thank you!
有关 Packt 的更多信息,请访问www.packtpub.com。
For more information about Packt, please visit www.packtpub.com.
一旦您阅读了《掌握 NLP 从基础到LLMs》,我们很想听听您的想法!请点击此处直接进入本书的亚马逊评论页面并分享您的反馈。
Once you’ve read Mastering NLP from Foundations to LLMs, we’d love to hear your thoughts! Please click here to go straight to the Amazon review page for this book and share your feedback.
您的评论对我们和技术社区都很重要,并将帮助我们确保提供优质的内容。
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
感谢您购买本书!
Thanks for purchasing this book!
您是否喜欢在旅途中阅读,但无法随身携带纸质书籍?
Do you like to read on the go but are unable to carry your print books everywhere?
您购买的电子书是否与您选择的设备不兼容?
Is your eBook purchase not compatible with the device of your choice?
不用担心,现在对于每本 Packt 书籍,您都可以免费获得该书的无 DRM 的 PDF 版本。
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
随时随地在任何设备上阅读。从您喜爱的技术书籍中搜索、复制代码并将其直接粘贴到您的应用程序中。
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
福利还不止于此,您每天都可以在收件箱中独家获得折扣、新闻通讯和精彩的免费内容
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
请遵循以下简单步骤即可获得好处:
Follow these simple steps to get the benefits:
https://packt.link/free-ebook/978-1-80461-918-6
https://packt.link/free-ebook/978-1-80461-918-6
本书旨在帮助专业人士将自然语言处理(NLP)技术应用到他们的工作中,无论他们是从事 NLP 项目还是在数据科学等其他领域使用 NLP。本书的目的是向您介绍 NLP 领域及其底层技术,包括机器学习( ML ) 和深度学习( DL )。在整本书中,我们强调了数学基础的重要性,例如线性代数、统计和概率以及优化理论,这些对于理解 NLP 中使用的算法是必要的。内容附有 Python 代码示例,允许您预先练习、实验并生成书中介绍的一些开发内容。
This book is aimed at helping professionals apply natural language processing (NLP) techniques to their work, whether they are working on NLP projects or using NLP in other areas, such as data science. The purpose of the book is to introduce you to the field of NLP and its underlying techniques, including machine learning (ML) and deep learning (DL). Throughout the book, we highlight the importance of mathematical foundations, such as linear algebra, statistics and probability, and optimization theory, which are necessary to understand the algorithms used in NLP. The content is accompanied by code examples in Python to allow you to pre-practice, experiment, and generate some of the development presented in the book.
本书讨论了 NLP 面临的挑战,例如理解单词的上下文和含义、它们之间的关系以及对标记数据的需求。书中还提到了 NLP 的最新进展,包括预训练的语言模型,例如 BERT 和 GPT,以及大量文本数据的可用性,这些都提高了NLP 任务的性能。
The book discusses the challenges faced in NLP, such as understanding the context and meaning of words, the relationships between them, and the need for labeled data. The book also mentions the recent advancements in NLP, including pre-trained language models, such as BERT and GPT, and the availability of large amounts of text data, which has led to improved performance on NLP tasks.
本书将通过讨论语言模型对 NLP 领域的影响来吸引您,包括提高 NLP 任务的准确性和有效性、开发更先进的 NLP 系统以及更广泛的人群的可访问性。
The book will engage you by discussing the impact of language models on the field of NLP, including improved accuracy and effectiveness in NLP tasks, the development of more advanced NLP systems, and accessibility to a broader range of people.
我们将在本章中介绍以下标题:
We will be covering the following headings in the chapter:
本书的目标读者是在项目中使用文本的专业人士。这可能包括 NLP 从业者,他们可能是初学者,以及那些通常不使用文本的人。
The target audience of the book is professionals who work with text as part of their projects. This may include NLP practitioners, who may be beginners, as well as those who do not typically work with text.
NLP是人工智能( AI )的一个领域,专注于计算机与人类之间的交互语言。它涉及使用计算理解、解释和生成人类语言的技术,使计算机能够自然且有意义地理解和响应人类输入。
NLP is a field of artificial intelligence (AI) focused on the interaction between computers and human languages. It involves using computational techniques to understand, interpret, and generate human language, making it possible for computers to understand and respond to human input naturally and meaningfully.
NLP 的历史非常精彩穿越时间的旅程,可以追溯到 20 世纪 50 年代,艾伦·图灵等先驱做出了重大贡献。图灵的开创性论文《计算机器与智能》介绍了图灵测试,为未来人工智能和自然语言处理的探索奠定了基础。这一时期标志着符号 NLP 的开始,其特点是使用基于规则的系统,例如 1954 年著名的乔治城实验,该实验雄心勃勃地旨在通过将俄语内容翻译成英语来解决机器翻译问题(参见https:// en.wikipedia.org/wiki/Georgetown%E2%80%93IBM_experiment)。尽管早期很乐观,但进展缓慢,揭示了语言理解和生成的复杂性。
The history of NLP is a fascinating journey through time, tracing back to the 1950s, with significant contributions from pioneers such as Alan Turing. Turing’s seminal paper, Computing Machinery and Intelligence, introduced the Turing test, laying the groundwork for future explorations in AI and NLP. This period marked the inception of symbolic NLP, characterized by the use of rule-based systems, such as the notable Georgetown experiment in 1954, which ambitiously aimed to solve machine translation by generating a translation of Russian content into English (see https://en.wikipedia.org/wiki/Georgetown%E2%80%93IBM_experiment). Despite early optimism, progress was slow, revealing the complexities of language understanding and generation.
20 世纪 60 年代和 1970 年代见证了早期 NLP 系统的发展,该系统展示了机器使用有限的词汇和知识库进行类人交互的潜力。这个时代还见证了概念本体的创建,这对于以计算机可理解的格式构建现实世界信息至关重要。然而,在 ML 的进步和计算能力的增强的推动下,基于规则的方法的局限性导致了 20 世纪 80 年代末向统计 NLP 的范式转变。这一转变使得能够更有效地从大型语料库中学习,从而显着推进机器翻译和其他 NLP 任务。这种范式转变不仅代表了技术和方法论的进步,而且强调了 NLP 中语言学方法的概念演变。在摆脱预定义语法规则的僵化过程中,这种转变采用了语料库语言学,这种方法允许机器通过广泛接触大量文本来“感知”和理解语言。这种方法反映了对语言更加实证和数据驱动的理解,其中模式和含义源自实际语言使用而不是理论构造,从而实现更细致、更灵活的语言处理能力。
The 1960s and 1970s saw the development of early NLP systems, which demonstrated the potential for machines to engage in human-like interactions using limited vocabularies and knowledge bases. This era also witnessed the creation of conceptual ontologies, crucial for structuring real-world information in a computer-understandable format. However, the limitations of rule-based methods led to a paradigm shift in the late 1980s towards statistical NLP, fueled by advances in ML and increased computational power. This shift enabled more effective learning from large corpora, significantly advancing machine translation and other NLP tasks. This paradigm shift not only represented a technological and methodological advancement but also underscored a conceptual evolution in the approach to linguistics within NLP. In moving away from the rigidity of predefined grammar rules, this transition embraced corpus linguistics, a method that allows machines to “perceive” and understand languages through extensive exposure to large bodies of text. This approach reflects a more empirical and data-driven understanding of language, where patterns and meanings are derived from actual language use rather than theoretical constructs, enabling more nuanced and flexible language processing capabilities.
进入21世纪,网络的出现提供了大量数据,催化了无监督和半监督学习算法的研究。随着 2010 年代神经 NLP 的出现,这一突破随之而来,深度学习技术开始占据主导地位,在语言建模和解析方面提供了前所未有的准确性。这个时代的标志是 Word2Vec 等复杂模型的发展和深度神经网络的激增,推动 NLP 走向更自然、更有效的人机交互。随着我们继续在这些进步的基础上发展,NLP 站在了人工智能研究的最前沿,其历史反映了对理解和复制人类语言细微差别的不懈追求。
Entering the 21st century, the emergence of the web provided vast amounts of data, catalyzing research in unsupervised and semi-supervised learning algorithms. The breakthrough came with the advent of neural NLP in the 2010s, where DL techniques began to dominate, offering unprecedented accuracy in language modeling and parsing. This era has been marked by the development of sophisticated models such as Word2Vec and the proliferation of deep neural networks, driving NLP towards more natural and effective human-computer interaction. As we continue to build on these advancements, NLP stands at the forefront of AI research, with its history reflecting a relentless pursuit of understanding and replicating the nuances of human language.
近年来,NLP也被广泛应用于医疗、金融、社交媒体等行业,用于自动化决策、增强人与机器之间的沟通。例如,NLP 已被用于从医疗文档中提取信息、分析客户反馈、在语言之间翻译文档以及搜索大量帖子。
In recent years, NLP has also been applied to a wide range of industries, such as healthcare, finance, and social media, where it has been used to automate decision-making and enhance communication between humans and machines. For example, NLP has been used to extract information from medical documents, analyze customer feedback, translate documents between languages, and search through enormous amounts of posts.
NLP 中的传统方法包括文本预处理,它与文本准备同义,即文本预处理。然后应用机器学习方法。预处理文本是 NLP 和 ML 应用程序中的重要步骤。它涉及将原始文本数据清理并转换为机器学习算法可以轻松理解和分析的形式。预处理的目标是消除噪声和不一致并标准化数据,使其更适合高级 NLP 和ML 方法。
Traditional methods in NLP consist of text preprocessing, which is synonymous with text preparation, which is then followed by applying ML methods. Preprocessing text is an essential step in NLP and ML applications. It involves cleaning and transforming the original text data into a form that can be easily understood and analyzed by ML algorithms. The goal of preprocessing is to remove noise and inconsistencies and standardize the data, making it more suitable for advanced NLP and ML methods.
预处理的主要好处之一是它可以显着提高机器学习算法的性能。例如,删除停用词(例如“the”和“is”等没有太多意义的常用词)可以帮助降低数据的维度,使算法更容易识别模式。
One of the key benefits of preprocessing is that it can significantly improve the performance of ML algorithms. For example, removing stop words, which are common words that do not carry much meaning, such as “the” and “is,” can help reduce the dimensionality of the data, making it easier for the algorithm to identify patterns.
以下面的句子为例:
Take the following sentence as an example:
我要去商店买一些牛奶和面包。
I am going to the store to buy some milk and bread.
删除停用词后,我们有以下内容:
After removing the stop words, we have the following:
去商店买牛奶面包。
going store buy milk bread.
在例句中,停用词“ I ”、“ am ”、“ to ”、“ the ”、“ some ”和“ and ”不会给句子添加任何附加含义,可以在不改变整体含义的情况下删除。的句子。应该强调的是,停用词的删除需要根据具体目标进行调整,因为省略特定单词在一种情况下可能微不足道,但在另一种情况下却是有害的。
In the example sentence, the stop words “I,” “am,” “to,” “the,” “some,” and “and” do not add any additional meaning to the sentence and can be removed without changing the overall meaning of the sentence. It should be emphasized that the removal of stop words needs to be tailored to the specific objective, as the omission of a particular word might be trivial in one context but detrimental in another.
此外,词干提取和词形还原可以减少单词的基本形式,可以帮助减少数据中唯一单词的数量,使算法更容易识别它们之间的关系,这将在本书中完整解释。
Additionally, stemming and lemmatization, which reduce words to their base forms, can help reduce the number of unique words in the data, making it easier for the algorithm to identify relationships between them, which will be explained completely in this book.
以下面的句子为例:
Take the following sentence as an example:
孩子们跑得快、跳得快、游得快。
The boys ran, jumped, and swam quickly.
应用词干提取(将每个单词简化为其词根或词干形式)后,不考虑单词时态或派生词缀,我们可能会得到:
After applying stemming, which reduces each word to its root or stem form, disregarding word tense or derivational affixes, we might get:
男孩跑得快、跳得快、游得快。
The boy ran, jump, and swam quick.
词干提取将文本简化为其基本形式。在此示例中,“ ran ”、“ jumped ”和“ swam ”分别简化为“ ran ”、“ jump ”和“ swam ”。请注意,“ran”和“swam”不会改变,因为词干提取通常会产生接近其词根形式但不完全是字典基本形式的单词。此过程有助于降低文本数据的复杂性,使机器学习算法更容易匹配和分析模式,而不会因同一单词的变化而陷入困境。
Stemming simplifies the text to its base forms. In this example, “ran,” “jumped,” and “swam” are reduced to “ran,” “jump,” and “swam,” respectively. Note that “ran” and “swam” do not change, as stemming often results in words that are close to their root form but not exactly the dictionary base form. This process helps reduce the complexity of the text data, making it easier for machine learning algorithms to match and analyze patterns without getting bogged down by variations of the same word.
以下面的句子为例:
Take the following sentence as an example:
孩子们跑得快、跳得快、游得快。
The boys ran, jumped, and swam quickly.
应用词形还原(考虑单词的形态分析)后,旨在返回单词的基本形式或字典形式(称为引理),我们得到:
After applying lemmatization, which considers the morphological analysis of the words, aiming to return the base or dictionary form of a word, known as the lemma, we get:
这个男孩跑得快、跳得快、游得快。
The boy run, jump, and swim quickly.
词形还原准确地将“ ran ”、“ jumped ”和“ swam ”转换为“ run ”、“ jump ”和“ swim ”。此过程考虑了每个单词的词性,确保还原为基本形式在语法和上下文上都是适当的。与词干提取不同,词形还原可以更精确地还原基本形式,从而确保处理后的文本仍然有意义并且上下文准确。这使 NLP 模型能够更有效地理解和处理语言,从而提高了 NLP 模型的性能,降低了数据集的复杂性,同时保持了原始文本的完整性。
Lemmatization accurately converts “ran,” “jumped,” and “swam” to “run,” “jump,” and “swim.” This process takes into account the part of speech of each word, ensuring that the reduction to the base form is both grammatically and contextually appropriate. Unlike stemming, lemmatization provides a more precise reduction to the base form, ensuring that the processed text remains meaningful and contextually accurate. This enhances the performance of NLP models by enabling them to understand and process language more effectively, reducing the dataset’s complexity while maintaining the integrity of the original text.
预处理的另外两个重要方面是数据标准化和数据清理。数据规范化包括将所有文本转换为小写、删除标点符号以及标准化数据格式。这有助于确保算法不会将同一单词的不同变体视为单独的实体,这可能导致结果不准确。
Two other important aspects of preprocessing are data normalization and data cleaning. Data normalization includes converting all text to lowercase, removing punctuation, and standardizing the format of the data. This helps to ensure that the algorithm does not treat different variations of the same word as separate entities, which can lead to inaccurate results.
数据清理包括删除重复或不相关的数据,并纠正数据中的错误或不一致。这在大型数据集中尤其重要,因为手动清理既耗时又容易出错。自动化预处理工具可以帮助快速识别和消除错误,使数据分析更加可靠。
Data cleaning includes removing duplicate or irrelevant data and correcting errors or inconsistencies in the data. This is particularly important in large datasets, where manual cleaning is time-consuming and error-prone. Automated preprocessing tools can help to quickly identify and remove errors, making the data more reliable for analysis.
图 1 .1描绘了一个全面的预处理流程。我们将在第 4 章中介绍此代码示例:
Figure 1.1 portrays a comprehensive preprocessing pipeline. We will cover this code example in Chapter 4:
图 1.1 – 综合预处理流程
Figure 1.1 – Comprehensive preprocessing pipeline
总之,文本预处理是 NLP 和 ML 应用中至关重要的一步;它通过消除噪声和不一致以及标准化数据来提高机器学习算法的性能。此外,它在 NLP 任务的数据准备和数据清理。通过在预处理上投入时间和资源,可以确保数据的高质量,并为先进的 NLP 和 ML 方法做好准备,从而获得更准确、更可靠的结果。
In conclusion, preprocessing text is a vital step in NLP and ML applications; it improves the performance of ML algorithms by removing noise and inconsistencies and standardizing the data. Additionally, it plays a crucial role in data preparation for NLP tasks and in data cleaning. By investing time and resources in preprocessing, one can ensure that the data is of high quality and is ready for advanced NLP and ML methods, resulting in more accurate and reliable results.
当我们的文本数据准备好进一步处理时,下一步通常涉及对其拟合 ML 模型。
As our text data is prepared for further processing, the next step typically involves fitting an ML model to it.
机器学习是人工智能的一个子领域,涉及训练从数据中学习的算法,使它们能够做出预测或决策,而无需这些数据明确编程。ML 正在推动许多不同领域的进步,例如计算机视觉、语音识别,当然还有NLP。
ML is a subfield of AI that involves training algorithms to learn from data, allowing them to make predictions or decisions without those being explicitly programmed. ML is driving advancements in so many different fields, such as computer vision, voice recognition, and, of course, NLP.
进一步深入了解 ML 的具体技术,NLP 中使用的一项特殊技术是统计语言建模,它涉及在大型文本语料库上训练算法来预测给定单词序列的可能性。它被用于广泛的应用,例如语音识别、机器翻译和文本生成。
Diving a little more into the specific techniques of ML, a particular technique used in NLP is statistical language modeling, which involves training algorithms on large text corpora to predict the likelihood of a given sequence of words. This is used in a wide range of applications, such as speech recognition, machine translation, and text generation.
另一项重要技术是深度学习,它是机器学习的一个子领域,涉及训练人工神经网络网络上的大量数据。DL 模型,例如卷积神经网络( CNN ) 和循环神经网络( RNN ),已事实证明,它足以胜任语言理解、文本摘要和情感分析等 NLP 任务。
Another essential technique is DL, which is a subfield of ML that involves training artificial neural networks on large amounts of data. DL models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), have been shown to be adequate for NLP tasks such as language understanding, text summarization, and sentiment analysis.
图 1 .2描绘了 AI、ML、DL和 NLP之间的关系
Figure 1.2 portrays the relationship between AI, ML, DL, and NLP:
图 1.2 – 不同学科之间的关系
Figure 1.2 – The relationship between the different disciplines
NLP 和 ML 的坚实基础是算法的数学基础。特别是,关键基础是线性代数、统计和概率以及优化理论。第 2 章将调查了解这些主题所需的关键主题。在整本书中,我们将为各种方法和假设提供证明和理由。
The solid base for NLP and ML is the mathematical foundations from which the algorithms stem. In particular, the key foundations are linear algebra, statistics and probability, and optimization theory. Chapter 2 will survey the key topics you will need to understand these topics. Throughout the book, we will present proofs and justifications for the various methods and hypotheses.
NLP 的挑战之一是处理以人类语言生成的大量数据。这包括理解上下文、单词的含义以及它们之间的关系。为了应对这一挑战,研究人员开发了各种技术,例如嵌入和注意力机制,它们分别以数字格式表示单词的含义,并帮助识别文本中最关键的部分。
One of the challenges in NLP is dealing with the vast amount of data that is generated in human language. This includes understanding the context, as well as the meaning of the words and relationships between them. To deal with this challenge, researchers have developed various techniques, such as embeddings and attention mechanisms, which represent the meaning of words in a numerical format and help identify the most critical parts of the text, respectively.
NLP 的另一个挑战是需要标记数据,因为手动注释大型文本语料库既昂贵又耗时。为了解决这个问题,研究人员开发了无监督和弱监督方法,可以从未标记的数据中学习,例如聚类、主题建模和自监督学习。
Another challenge in NLP is the need for labeled data, as manually annotating large text corpora is expensive and time-consuming. To address this problem, researchers have developed unsupervised and weakly supervised methods that can learn from unlabeled data, such as clustering, topic modeling, and self-supervised learning.
总体而言,NLP 是一个快速发展的领域,有潜力改变我们与计算机交互的方式信息。它用于各种应用程序,从聊天机器人到语言翻译为文本摘要和情感分析。统计语言建模和深度学习等机器学习技术的使用对于开发这些系统至关重要。正在进行的研究解决了剩余的挑战,例如理解上下文和处理标记数据的缺乏。
Overall, NLP is a rapidly evolving field that has the potential to transform the way we interact with computers and information. It is used in various applications, from chatbots and language translation to text summarization and sentiment analysis. The use of ML techniques, such as statistical language modeling and DL, has been crucial in developing these systems. Ongoing research addresses the remaining challenges, such as understanding context and dealing with the lack of labeled data.
最重要的之一NLP 的进步在于预训练语言模型的开发,例如来自 Transformer ( BERT ) 的双向编码器表示和生成式预训练 Transformer ( GPT )。这些模型已经过大量训练文本数据,并且可以针对特定任务进行微调,例如情感分析或语言翻译。
One of the most significant advances in NLP has been the development of pre-trained language models, such as bidirectional encoder representations from transformers (BERTs) and generative pre-trained transformers (GPTs). These models have been trained on massive amounts of text data and can be fine-tuned for specific tasks, such as sentiment analysis or language translation.
Transformer 是 BERT 和 GPT 模型背后的技术,它使机器能够更有效地理解句子中单词的上下文,从而彻底改变了 NLP。与以前线性处理文本的方法不同,Transformer可以并行处理单词,通过注意机制捕获语言中的细微差别。这使他们能够辨别每个单词相对于其他单词的重要性,极大地增强了模型掌握复杂语言模式和细微差别的能力,并为 NLP 应用的准确性和流畅性设立了新标准。这增强了 NLP 应用程序的创建,并提高了各种NLP 任务的性能。
Transformers, the technology behind the BERT and GPT models, revolutionized NLP by enabling machines to understand the context of words in sentences more effectively. Unlike previous methods that processed text linearly, transformers can handle words in parallel, capturing nuances in language through attention mechanisms. This allows them to discern the importance of each word relative to others, greatly enhancing the model’s ability to grasp complex language patterns and nuances and setting a new standard for accuracy and fluency in NLP applications. This has enhanced the creation of NLP applications and has led to improved performance on a wide range of NLP tasks.
图 1 .3详细介绍了 Transformer 组件的功能设计。
Figure 1.3 details the functional design of the Transformer component.
图 1.3 – 模型架构中的 Transformer
Figure 1.3 – Transformer in model architecture
NLP 的另一个重要发展是大量带注释文本数据的可用性增加,这使得可以训练更准确的模型。此外,无监督和半监督学习的发展技术允许在少量标记数据上训练模型,从而使 NLP 在更广泛的场景中应用成为可能。
Another important development in NLP has been the increase in the availability of large amounts of annotated text data, which has allowed for the training of more accurate models. Additionally, the development of unsupervised and semi-supervised learning techniques has allowed for the training of models on smaller amounts of labeled data, making it possible to apply NLP in a wider range of scenarios.
语言模型对 NLP 领域产生了重大影响。语言模型改变该领域的关键方式之一是提高自然语言处理任务的准确性和有效性。例如,许多语言模型都经过大量文本数据的训练,使它们能够更好地理解人类语言的细微差别和复杂性。这提高了语言翻译、文本摘要和情感分析等任务的性能。
Language models have had a significant impact on the field of NLP. One of the key ways that language models have changed the field is by improving the accuracy and effectiveness of natural language processing tasks. For example, many language models have been trained on large amounts of text data, allowing them to better understand the nuances and complexities of human language. This has led to improved performance in tasks such as language translation, text summarization, and sentiment analysis.
语言模型改变 NLP 领域的另一种方式是支持更高级、更复杂的 NLP 的开发系统。例如,一些语言模型,例如GPT,可以生成类似人类的文本,这为自然语言生成和对话系统开辟了新的可能性。其他语言模型,例如 BERT,提高了问答、情感分析和命名实体识别等任务的性能。
Another way that language models have changed the field of NLP is by enabling the development of more advanced, sophisticated NLP systems. For example, some language models, such as GPT, can generate human-like text, which has opened up new possibilities for natural language generation and dialogue systems. Other language models, such as BERT, have improved the performance of tasks such as question answering, sentiment analysis, and named entity recognition.
语言模型也改变了这个领域,让更广泛的人更容易接触到它。随着预训练语言模型的出现,开发人员现在可以轻松地将这些模型微调到特定任务,而无需大量标记数据或从头开始训练模型的专业知识。这使得开发人员更容易构建 NLP 应用程序,并导致基于 NLP 的新产品和服务的爆炸式增长。
Language models have also changed the field by making it more accessible to a broader range of people. With the advent of pre-trained language models, developers can now easily fine-tune these models to specific tasks without the need for large amounts of labeled data or the expertise to train models from scratch. This has made it easier for developers to build NLP applications and has led to an explosion of new NLP-based products and services.
总体而言,语言模型通过提高现有 NLP 任务的性能、支持开发更先进的 NLP 系统以及使 NLP 更容易被更广泛的人群使用,在推动 NLP 领域发挥了关键作用。
Overall, language models have played a key role in advancing the field of NLP by improving the performance of existing NLP tasks, enabling the development of more advanced NLP systems, and making NLP more accessible to a broader range of people.
ChatGPT 是 GPT 模型的变体,因其能够生成类人文本而变得流行,可用于广泛的自然语言生成任务,例如聊天机器人系统、文本摘要和对话系统。
ChatGPT, a variant of the GPT model, has become popular because of its ability to generate human-like text, which can be used for a broad range of natural language generation tasks, such as chatbot systems, text summarization, and dialogue systems.
其受欢迎的主要原因是其高质量的输出以及生成难以与人类编写的文本区分开的文本的能力。这使得它非常适合需要听起来自然的文本,例如聊天机器人系统、虚拟助手和文本摘要。
The main reason for its popularity is its high-quality outputs and its ability to generate text that is hard to distinguish from text written by humans. This makes it well-suited for applications that require natural-sounding text, such as chatbot systems, virtual assistants, and text summarization.
此外,ChatGPT 经过大量文本数据的预训练,使其能够理解人类语言的细微差别和复杂性。这使得它非常适合需要深入理解语言的应用程序,例如问答和情感分析。
Additionally, ChatGPT is pre-trained on a large amount of text data, allowing it to understand human language nuances and complexities. This makes it well-suited for applications that require a deep understanding of language, such as question answering and sentiment analysis.
此外,ChatGPT 可以通过为其提供少量特定于任务的数据来针对特定用例进行微调,这使其具有多功能性并适用于广泛的应用程序。它广泛应用于工业、研究和个人项目,范围包括客户服务聊天机器人、虚拟助理、自动化内容创建、文本摘要、对话系统、问答和情感分析。
Moreover, ChatGPT can be fine-tuned for specific use cases by providing it with a small amount of task-specific data, which makes it versatile and adaptable to a wide range of applications. It is widely used in industry, research, and personal projects, ranging from customer service chatbots, virtual assistants, automated content creation, text summarization, dialogue systems, question answering, and sentiment analysis.
总体而言,ChatGPT 生成高质量、类人文本的能力以及针对特定任务进行微调的能力使其成为各种自然语言生成应用程序的流行选择。
Overall, ChatGPT’s ability to generate high-quality, human-like text and its ability to be fine-tuned for specific tasks makes it a popular choice for a wide range of natural language generation applications.
现在让我们继续总结本章。
Let’s move on to summarize the chapter now.
在本章中,我们向您介绍了 NLP 领域,它是人工智能的一个子领域。本章强调了数学基础的重要性,例如线性代数、统计和概率以及优化理论,这些对于理解 NLP 中使用的算法是必要的。它还涵盖了 NLP 中面临的挑战,例如理解单词的上下文和含义、它们之间的关系以及标记数据的需求。我们讨论了 NLP 的最新进展,包括预训练的语言模型,例如 BERT 和 GPT,以及大量文本数据的可用性,这导致了 NLP 任务性能的提高。当您了解文本预处理中数据清理、数据标准化、词干提取和词形还原的重要性时,我们谈到了文本预处理的重要性。然后我们讨论了 NLP 和 ML 的结合如何推动该领域的进步,并如何成为自动化任务和改善人机交互的日益重要的工具。
In this chapter, we introduced you to the field of NLP, which is a subfield of AI. The chapter highlights the importance of mathematical foundations, such as linear algebra, statistics and probability, and optimization theory, which are necessary to understand the algorithms used in NLP. It also covers the challenges faced in NLP, such as understanding the context and meaning of words, the relationships between them, and the need for labeled data. We discussed the recent advancements in NLP, including pre-trained language models, such as BERT and GPT, and the availability of large amounts of text data, which has led to improved performance in NLP tasks. We touched on the importance of text preprocessing as you gains knowledge of the importance of data cleaning, data normalization, stemming, and lemmatization in text preprocessing. We then talked about how the coming together of NLP and ML is driving advancements in the field and is becoming an increasingly important tool for automating tasks and improving human-computer interaction.
学习完本章后,您将能够理解 NLP、ML 和 DL 技术的重要性。您将能够了解 NLP 的最新进展,包括预训练的语言模型。您还将了解文本预处理的重要性以及它如何在 NLP 任务的数据准备和数据清理中发挥关键作用。
After learning from this chapter, you will be able to understand the importance of NLP, ML, and DL techniques. you will be able to understand the recent advancements in NLP, including pre-trained language models. you will also have gained knowledge of the importance of text preprocessing and how it plays a crucial role in data preparation for NLP tasks and in data cleaning.
在下一章中,我们将介绍机器学习的数学基础。这些基础将为我们提供整本书的帮助。
In the next chapter, we will cover the mathematical foundations of ML. These foundations will serve us throughout the book.
自然语言处理( NLP ) 和机器学习( ML ) 是从数学概念(尤其是线性代数和概率论)中受益匪浅的两个领域。这些基本工具可以分析变量之间的关系,构成许多 NLP 和 ML 模型的基础。本章全面介绍线性代数和概率论,包括它们在 NLP 和 ML 中的实际应用。的c本章首先概述向量和矩阵,并涵盖基本运算。此外,还将解释理解后续章节中的概念和模型所需的统计学基础知识。最后,本章介绍了优化的基础知识,这对于解决 NLP 问题和理解变量之间的关系至关重要。读完本章后,您将在线性代数和概率论方面打下坚实的基础,并了解它们在 NLP和 ML 中的基本应用。
Natural language processing (NLP) and machine learning (ML) are two fields that have significantly benefited from mathematical concepts, particularly linear algebra and probability theory. These fundamental tools enable the analysis of the relationships between variables, forming the basis of many NLP and ML models. This chapter provides a comprehensive introduction to linear algebra and probability theory, including their practical applications in NLP and ML. The chapter commences with an overview of vectors and matrices and covers essential operations. Additionally, the basics of statistics, required for understanding the concepts and models in subsequent chapters, will be explained. Finally, the chapter introduces the fundamentals of optimization, which are critical for solving NLP problems and understanding the relationships between variables. By the end of this chapter, you will have a solid foundation in linear algebra and probability theory and understand their essential applications in NLP and ML.
在本章中,我们将讨论以下主题:
In this chapter, we’ll be covering the following topics:
让我们首先了解标量、向量和矩阵:
Let’s start by first understanding scalars, vectors, and matrices:
接下来让我们继续讨论标量、向量和矩阵的基本运算。
Let’s move on to the basic operations for scalars, vectors, and matrices next.
标量、向量和矩阵的基本运算(加法和减法)可以在具有相同维度的向量上执行。让我们有两个向量:
The basic operations for scalars, vectors, and matrices—addition and subtraction—can be carried out on vectors with the same dimensions. Let’s have two vectors:
例如,如果我们有两个向量 a = [4,1] 和 b = [2,4],则a + b = [ 6,5]。
For example, if we have two vectors, a = [4,1] and b = [2,4], then a + b = [6,5].
Let’s visualize this as follows:
图 2.1 – 两个向量相加(a = [4,1] 和 b = [2,4])意味着 a + b = [6,5]
Figure 2.1 – Adding two vectors (a = [4,1] and b = [2,4]) means that a + b = [6,5]
可以通过将向量乘以标量来缩放向量。该运算是通过将向量的每个分量乘以标量值来执行的。例如,让我们考虑一个 n 维向量。将该向量缩放为 因子的过程可以用数学方式表示如下:
It is possible to scale a vector by multiplying it by a scalar. This operation is performed by multiplying each component of the vector by the scalar value. For example, let’s consider a n-dimensional vector, . The process of scaling this vector by a factor of can be represented mathematically as follows:
此操作会产生一个新向量,该向量与原始向量具有相同的维数,但每个分量都乘以标量值。
This operation results in a new vector that has the same dimensionality as the original vector but with each component multiplied by the scalar value .
有两种类型向量之间的乘法:点积 ( ) 和叉积 ( )。点积是我们在机器学习算法中经常使用的一种。
There are two types of multiplications between vectors: dot product () and cross product (). The dot product is the one we use often in ML algorithms.
点积是一种数学运算,可应用于两个向量x = [ x 1 , x 2 , … , x n ]和。它有许多实际应用,其中之一就是帮助确定它们的相似性。它被定义为两个向量的相应元素的乘积之和。x和y的点积由符号x y(中间有一个点)表示,定义如下:
The dot product is a mathematical operation that can be applied to two vectors, x = [x 1, x 2, … , x n] and . It has many practical applications, one of which is to help determine their similarity. It is defined as the sum of the product of the corresponding elements of the two vectors. The dot product of x and y is represented by the symbol x y (having a dot in the middle) and is defined as follows:
其中n表示向量的维数。点积是一个标量,可用于测量两个向量之间的角度,以及一个向量到另一个向量的投影。它还在许多机器学习算法中发挥着重要作用,包括线性回归和神经网络。
where n represents the dimensionality of the vectors. The dot product is a scalar quantity and can be used to measure the angle between two vectors, as well as the projection of one vector onto another. It also serves a vital function in numerous ML algorithms, including linear regression and neural networks.
点积是可交换的,这意味着向量的顺序不会影响结果。这意味着x y = y x。此外,点积保持了标量乘法的分配性质,意味着以下内容:
The dot product is commutative, meaning that the order of the vectors does not affect the result. This means that x y = y x. Furthermore, the dot product maintains the distributive property of scalar multiplication, implying the following:
向量与其自身的点积也称为其平方范数或欧几里德范数。范数,用𝑛𝑜𝑟𝑚 (x)表示,表示向量的长度是 计算为
The dot product of a vector with itself is also known as its squared norm or Euclidean norm. The norm, symbolized by 𝑛𝑜𝑟𝑚(x), signifies the length of the vector and is computed as
向量的归一化可以通过将它们除以范数(也称为欧几里得范数或向量的长度)来实现。这会产生一个单位长度的向量,用x'表示。归一化过程可以表示为
The normalization of vectors can be achieved by dividing them by their norm, also known as the Euclidean norm or the length of the vector. This results in a vector with a unit length, denoted by x’. The normalization process can be shown as
其中是原始向量并表示其范数。应该注意的是,归一化向量具有在将其长度设置为 1 的同时保留其方向的效果,从而允许在不同空间中对向量进行有意义的比较。
where is the original vector and represents its norm. It should be noted that normalizing a vector has the effect of retaining its direction while setting its length to 1, allowing the meaningful comparison of vectors in different spaces.
两个向量之间的余弦相似度 在 数学上表示为两个向量标准化为单位长度后的点积。这可以写成如下:
The cosine similarity between two vectors and is mathematically represented as the dot product of the two vectors after they have been normalized to unit length. This can be written as follows:
其中和分别是向量x和y的范数。计算出的x和y之间的余弦相似度相当于两个向量之间的角度的余弦,表示为θ。
where and are the norms of the vectors x and y, respectively. This computed cosine similarity between x and y is equivalent to the cosine of the angle between the two vectors, denoted as θ.
点积为0的向量被视为正交,这意味着在具有两个非0向量的情况下,它们之间的角度为 90 度。我们可以得出结论,0向量与任何向量正交。如果一组向量的每一对都是正交的并且每个向量的范数为1 ,则该向量组被认为是正交的。这种正交集在许多数学环境中被证明是有价值的。例如,它们在不同正交坐标系之间转换时发挥作用,其中点的新坐标是相对于修改后的方向集计算的。这种方法在解析几何领域被称为坐标变换,它发现在线性代数领域有着广泛的应用。
Vectors with a dot product of 0 are deemed orthogonal, implying that in the case of having both non-0 vectors, the angle between them is 90 degrees. We can conclude that a 0 vector is orthogonal to any vector. A group of vectors is considered orthogonal if each pair of them is orthogonal and each vector possesses a norm of 1. Such orthonormal sets prove to be valuable in numerous mathematical contexts. For instance, they come into play when transforming between different orthogonal co-ordinate systems, where the new co-ordinates of a point are computed in relation to the modified direction set. This approach, known as co-ordinate transformation in the field of analytical geometry, finds widespread application in the realm of linear algebra.
矩阵转置是获得矩阵转置的过程并涉及交换其行和列。这意味着原来位于矩阵中第(i, j)个位置的元素现在占据了其矩阵中第(j, i)个位置转置。结果,原来大小为n × m的矩阵转置后变为m × n矩阵。用于表示矩阵X转置的符号为。这是矩阵转置运算的说明性示例:
Matrix transpose is the process of obtaining the transpose of a matrix and involves interchanging its rows and columns. This means that the element originally at the (i, j)th position in the matrix now occupies the (j, i)th position in its transpose. As a result, a matrix that was originally of size n × m becomes an m × n matrix when transposed. The notation used to represent the transpose of matrix X is . Here’s an illustrative example of a matrix transposition operation:
至关重要的是,矩阵的转置恢复为原始矩阵X。此外,很明显,行向量可以转置为列向量,反之亦然。此外,以下内容对于矩阵和向量都成立:
Crucially, the transpose of matrix reverts to the original matrix X. Moreover, it is clear that row vectors can be transposed into column vectors and vice versa. Additionally, the following holds true for both matrices and vectors:
还值得注意的是,点积对于矩阵和向量是可交换的:
It’s also noteworthy that dot products are commutative for matrices and vectors:
在本节中,我们将介绍不同类型的矩阵定义:
In this section, we’ll cover the different type of matrix definitions:
是对称的。
is symmetric.
平方的行列式矩阵提供了乘以d维对象的坐标向量时对它的体积的影响的概念。行列式,符号为det(A),表示由矩阵的行向量或列向量形成的平行六面体的(有符号)体积。这种解释始终成立,因为由行向量和列向量确定的体积在数学上是相同的。当可对角化矩阵A与一组坐标向量相互作用时,随之而来的失真称为各向异性缩放。行列式可以帮助建立此转换的比例因子。方阵的行列式对于通过与矩阵相乘实现的线性变化具有重要的见解。特别是,行列式的符号反映了基于系统方向的变换的影响。
The determinant of a square matrix provides a notion of its impact on the volume of a d-dimensional object when multiplied by its co-ordinate vectors. The determinant, symbolized as det(A), represents the (signed) volume of the parallelepiped formed by the row or column vectors of the matrix. This interpretation holds consistently, as the volume determined by the row and column vectors is mathematically identical. When a diagonalizable matrix A interacts with a group of co-ordinate vectors, the ensuing distortion is termed anisotropic scaling. The determinant can aid in establishing the scale factors of this conversion. The determinant of a square matrix carries crucial insights about the linear alteration accomplished by the multiplication with the matrix. Particularly, the sign of the determinant mirrors the impact of the transformation on the basis of the system’s orientation.
Calculating determinant is given as follows:
j为固定值,范围从1到d,
with j as a fixed value ranging from 1 to d,
或者,使用固定的i,
Or, with the fixed i,
根据以下等式,我们可以看到某些情况可以轻松计算:
Based on the following equations, we can see that some of the cases can be easily calculated:
对于2 × 2 矩阵
其行列式可以计算为ad - bc。如果我们考虑一个3 × 3矩阵,
行列式计算如下:
For a 2 × 2 matrix of
Its determinant can be computed as ad - bc. If we consider a 3 × 3 matrix,
The determinant is calculated as follows:
Let’s now move on to eigenvalues and vectors.
属于d × d矩阵A 的向量x是特征向量,如果它满足方程Ax = λx,其中λ表示与矩阵相关的特征值。这种关系描绘了矩阵A与其对应的特征向量x之间的联系,可以看出作为矩阵的“拉伸方向”。如果A是可对角化的矩阵,则可以将其解构为d × d可逆矩阵V和对角d × d矩阵Δ,使得
A vector x, belonging to a d × d matrix A, is an eigenvector if it satisfies the equation Ax = λx, where λ represents the eigenvalue associated with the matrix. This relationship delineates the link between matrix A and its corresponding eigenvector x, which can be perceived as the “stretching direction” of the matrix. In the case where A is a matrix that can be diagonalized, it can be deconstructed into a d × d invertible matrix, V, and a diagonal d × d matrix, Δ, such that
V的列包含d特征向量,而Δ的对角线条目包含相应的特征值。线性变换Ax可以通过一系列的三个操作来直观地理解。最初, x的乘法计算与V的列相关的非正交基础上的 x坐标。随后, x 乘以Δ使用Δ中的因子缩放这些坐标,与特征向量的方向对齐。最后,与V相乘将坐标恢复到原始基础,从而导致沿d特征向量方向的各向异性缩放。
The columns of V encompass d eigenvectors, while the diagonal entries of Δ house the corresponding eigenvalues. The linear transformation Ax can be visually understood through a sequence of three operations. Initially, the multiplication of x by calculates x’s co-ordinates in a non-orthogonal basis associated with V’s columns. Subsequently, the multiplication of x by Δ scales these co-ordinates using the factors in Δ, aligned with the eigenvectors’ directions. Finally, the multiplication with V restores the co-ordinates to the original basis, resulting in an anisotropic scaling along the d eigenvector directions.
可对角矩阵表示涉及沿d线性独立方向的各向异性缩放的变换。当V的列是正交向量时,等于其转置 ,表示沿相互正交的方向缩放。在这种情况下,矩阵A始终可对角化,并且当V的列为正交向量,由以下关系确定。
Diagonalizable matrices signify transformations involving anisotropic scaling along d-linearly independent directions. When V ‘s columns are orthonormal vectors, equals its transpose, , indicating scaling along mutually orthogonal directions. In such cases, matrix A is always diagonalizable and exhibits symmetry when V’s columns are orthonormal vectors, as affirmed by the following relationship.
确定d × d矩阵A的特征向量的传统方法涉及查找方程的d根:
The conventional method to ascertain the eigenvectors of a d × d matrix A involves locating the d roots, of the equation:
其中一些根可能是重复。后续步骤涉及求解 形式的线性系统,通常使用高斯消去法来实现。然而,这种方法可能并不总是最稳定或最精确的,因为多项式方程的求解器在实际应用中可能会表现出病态和数值不稳定。事实上,在工程中求解高次多项式方程的流行技术涉及构造具有与原始多项式相同的特征多项式的伴随矩阵,然后确定其特征值。
Some of these roots might be repeated. The subsequent step involves solving linear systems in the form , typically achieved using the Gaussian elimination method. However, this method might not always be the most stable or precise, as solvers of polynomial equations can exhibit ill-conditioning and numerical instability in practical applications. Indeed, a prevalent technique for resolving high-degree polynomial equations in engineering involves constructing a companion matrix possessing the same characteristic polynomial as the original polynomial and then determining its eigenvalues.
特征值分解,也称为矩阵的特征分解或对角化,是线性代数和计算数学中使用的强大数学工具。特征值分解的目标是将给定矩阵分解为表示矩阵的特征向量和特征值的矩阵的乘积。
Eigenvalue decomposition, also known as the eigen-decomposition or the diagonalization of a matrix, is a powerful mathematical tool used in linear algebra and computational mathematics. The goal of eigenvalue decomposition is to decompose a given matrix into a product of matrices that represent the eigenvectors and eigenvalues of the matrix.
矩阵A的特征值分解是将矩阵分解为两个矩阵的乘积:矩阵V和矩阵D。
The eigenvalue decomposition of matrix A is a factorization of the matrix into the product of two matrices: the matrix V and the matrix D.
V 的列是矩阵A的特征向量,D是对角矩阵,包含其诊断上相应的特征值。
V has column which are the eigenvectors of matrix A, and D is a diagonal matrix that contains the corresponding eigenvalues on its diagnol.
特征值问题是找到非 0 向量v和标量λ,使得Av = λv,其中A是方阵,因此v是A的特征向量。标量λ称为矩阵A的特征值。特征值问题可以写成矩阵形式为Av = λIv,其中I是单位矩阵。
The eigenvalue problem is to find the non-0 vectors, v, and the scalars, λ, such that Av = λv, where A is a square matrix, and thus v is an eigenvector of A. The scalar λ is called the eigenvalue of matrix A. The eigenvalue problem can be written in matrix form as Av = λIv, where I is the identity matrix.
确定的过程特征值与矩阵A的特征方程密切相关,该方程是从 导出的多项式方程。可以求解特征方程以获得特征值λ,即方程的根。一旦找到特征值,就可以通过求解线性方程组来找到特征向量。
The process of determining eigenvalues is intimately linked to the characteristic equation of matrix A, which is the polynomial equation derived from . The characteristic equation can be solved for the eigenvalues, λ, which are the roots of the equation. Once the eigenvalues are found, the eigenvectors can be found by solving the system of linear equations .
特征值分解的一个重要属性是它允许我们对矩阵进行对角化,这意味着我们可以通过使用适当的特征向量矩阵将矩阵转换为对角形式。矩阵的对角形式很有用,因为它允许我们轻松计算矩阵的迹和行列式。
One important property of eigenvalue decomposition is that it allows us to diagonalize a matrix, which means that we can transform the matrix into a diagonal form by using an appropriate eigenvectors matrix. The diagonal form of a matrix is useful because it allows us to calculate the trace and determinant of the matrix easily.
特征值分解的另一个重要属性是它提供了对矩阵结构的深入了解。例如,对称矩阵的特征值始终是实数,并且特征向量是正交的,这意味着它们彼此垂直。在非对称矩阵的情况下,特征值可以是复数,并且特征向量不一定是正交的。
Another important property of eigenvalue decomposition is that it provides insight into the structure of the matrix. For example, the eigenvalues of a symmetric matrix are always real, and the eigenvectors are orthogonal, which means that they are perpendicular to each other. In the case of non-symmetric matrices, the eigenvalues can be complex, and the eigenvectors are not necessarily orthogonal.
矩阵的特征值分解在数学、物理、工程和计算机科学中有许多应用。在数值分析中,特征值分解用于查找线性系统的解、计算矩阵的特征值以及查找矩阵的特征向量。在物理学中,特征值分解用于分析系统的稳定性,例如微分方程中平衡点的稳定性。在工程中,特征值分解用于研究系统的动力学,例如机械系统的振动。
The eigenvalue decomposition of a matrix has many applications in mathematics, physics, engineering, and computer science. In numerical analysis, eigenvalue decomposition is used to find the solution of linear systems, compute the eigenvalues of a matrix, and find the eigenvectors of a matrix. In physics, eigenvalue decomposition is used to analyze the stability of systems, such as the stability of an equilibrium point in a differential equation. In engineering, eigenvalue decomposition is used to study the dynamics of systems, such as the vibrations of a mechanical system.
在计算机科学领域,特征值分解在各个领域都有多种应用,包括机器学习和数据分析。在机器学习中,特征值分解在实现主成分分析( PCA )方面发挥着关键作用,这是一种技术 用于广泛数据集的降维。在数据分析领域,利用特征值分解来计算奇异值分解(SVD ),这是剖析和理解复杂情况的有力工具数据集。
Within the field of computer science, eigenvalue decomposition finds versatile applications across various domains, including machine learning and data analysis. In machine learning, eigenvalue decomposition plays a pivotal role in enabling principal component analysis (PCA), a technique employed for dimensionality reduction in extensive datasets. In the realm of data analysis, eigenvalue decomposition is harnessed to calculate the singular value decomposition (SVD), a potent tool for dissecting and understanding complex datasets.
最小化 的问题是一个典型的问题,其中x是具有单位范数的列向量,A是对称的d × d数据矩阵在许多机器学习环境中遇到的问题。这种问题类型经常出现在主成分分析、奇异值分解和谱聚类等应用中,所有这些都涉及特征工程和降维。优化问题可以表述如下:
The problem of minimizing , where x is a column vector that has a unit norm, and A is a symmetric d × d data matrix, is a typical problem encountered in numerous machine learning contexts. This problem type is often found in applications such as principal component analysis, singular value decomposition, and spectral clustering, all of which involve feature engineering and dimensionality reduction. The optimization problem can be articulated as follows:
最小化
Minimize
受
Subject to
我们可以将优化问题解决为最大化或最小化形式。施加向量x必须是单位向量的约束显着改变了优化问题的性质。与前一节相反,矩阵A的半正定性对于确定解不再至关重要。即使A不定,对范数的约束向量x确保了明确定义的解,防止涉及具有无界大小的向量或平凡的解,例如0向量。值奇异分解( SVD ) 是一种数学技术,它将矩形矩阵A分解为三个矩阵:U、S和。矩阵A定义为n × p矩阵。SVD 定理指出,A可以表示为三个矩阵的乘积:、其中、 和、U和V是正交矩阵。
We can solve the optimization problem as a maximization or minimization form. Imposing the constraint that vector x must be a unit vector significantly changes the nature of the optimization problem. In contrast to the prior section, the positive semi-definiteness of matrix A is no longer crucial for determining the solution. Even when A is indefinite, the constraint on the norm of vector x ensures a well-defined solution, preventing the involvement of vectors with unbounded magnitudes or trivial solutions, such as the 0 vector. Value singular decomposition (SVD) is a mathematical technique that takes a rectangular matrix, A, and decomposes it into three matrices: U, S, and . Matrix A is defined as an n × p matrix. The theorem of SVD states that A can be represented as the product of three matrices: , where , and , and U and V are orthogonal matrices.
U矩阵的列称为左奇异向量,而V矩阵转置的行称为右奇异向量。S矩阵具有奇异值,是与A大小相同的对角矩阵。SVD 将原始数据分解为坐标系,其中定义向量是正交的(正交和法线)。 SVD 计算涉及识别矩阵和的特征值和特征向量。矩阵V的列由 中的特征向量组成,矩阵U的列由 中的特征向量组成。S矩阵中的奇异值源自 或 的特征值的平方根,并按降序排列。这些奇异的值是实数。如果A是实数矩阵,则U和V也将是实数。
The U matrix’s columns are known as the left singular vectors, while the rows of the transpose of the V matrix are the right singular vectors. The S matrix, with singular values, is a diagonal matrix of the same size as A. SVD decomposes the original data into a co-ordinate system where the defining vectors are orthonormal (both orthogonal and normal). SVD computation involves identifying the eigenvalues and eigenvectors of matrices and . Matrix V’s columns consist of eigenvectors from , and matrix U’s columns consist of eigenvectors from . Singular values in the S matrix are derived from the square roots of eigenvalues from either or , organized in decreasing order. These singular values are real numbers. If A is a real matrix, U and V will also be real.
为了说明SVD的计算,提供一个例子。考虑一个4 × 2矩阵。可以通过计算然后确定这些矩阵的特征向量来找到矩阵的特征值。 U的列由 的特征向量形成,V的列由 的特征向量形成。S矩阵包含 或的特征值的平方根。通过求解给定示例中的特征方程即可找到特征值,其中W是矩阵,I是单位矩阵,λ是特征值。然后通过求解从特征值方程导出的方程组来找到特征向量。然后通过组合特征向量和奇异值获得最终矩阵U 、S、 和。
To illustrate the calculation of SVD, an example is provided. Consider a 4 × 2 matrix. The eigenvalues of the matrix can be found by computing and and then determining the eigenvectors of these matrices. U’s columns are formed by the eigenvectors of , and V’s columns are formed by the eigenvectors of . The S matrix comprises the square root of eigenvalues from either or . Eigenvalues are found by solving the characteristic equation in the given example , where W is the matrix, I is the unit matrix, and λ is the eigenvalue. The eigenvectors are then found by solving the set of equations derived from the eigenvalue equations. The final matrices U, S, and are then obtained by combining the eigenvectors and singular values.
需要注意的是,奇异值是按降序排列的,其中
It should be noted that the singular values are in descending order, with
现在让我们继续学习机器学习的基本概率。
Let’s now move on to basic probability for machine learning.
概率提供信息关于事件发生的可能性。在该领域,有几个重要术语需要理解:
Probability provides information about the likelihood of an event occurring. In this field, there are several key terms that are important to understand:
因此,用技术术语来说,概率是对进行实验时事件发生的可能性的度量。
Therefore, in technical terms, probability is a measure of the likelihood of an event occurring when an experiment is conducted.
在这个非常简单的例子中,事件A产生一个结果的概率等于事件A发生的概率除以所有可能发生的事件的概率。例如,在掷一枚均匀的硬币时,有相同机会出现两种结果:正面和反面。出现正面的概率为1/(1+1) = ½。
In this very simple case, the probability of event A with one outcome is equal to the chance of event A divided by the chance of all possible events. For example, in flipping a fair coin, there are two outcomes with the same chance: heads and tails. The chance of having heads will be 1/(1+1) = ½.
为了计算概率,给定一个具有n 个结果的事件A和一个样本空间S,事件 A 的概率计算如下
In order to calculate the probability, given an event, A, with n outcomes and a sample space, S, the probability of event A is calculated as
其中代表A中的结果。假设实验的所有结果具有相同的概率,并且其中一个结果的选择不会影响后续轮次中其他结果的选择(意味着它们在统计上是独立的),那么
where represents the outcomes in A. Assuming all results of the experiment have equal probability, and the selection of one does not influence the selection of others in subsequent rounds (meaning they are statistically independent), then
因此,概率值的范围是0到1,样本空间体现了完整的潜在结果集,表示为P(S) = 1。
Hence, the value of probability ranges from 0 to 1, with the sample space embodying the complete set of potential outcomes, denoted as P(S) = 1.
在统计领域,有两个如果一个事件的发生不影响另一事件发生的可能性,则事件被定义为独立的。正式地说,当P(A 和 B) = P(A)P(B)时,事件A和B是独立的,其中P(A)和P(B)分别是事件A和B发生的概率。
In the realm of statistics, two events are defined as independent if the occurrence of one event doesn’t influence the likelihood of the other event’s occurrence. To put it formally, events A and B are independent precisely when P(A and B) = P(A)P(B), where P(A) and P(B) are the respective probabilities of events A and B happening.
考虑这个例子来阐明统计独立性的概念:假设我们拥有两枚硬币,一枚是公平的(出现正面或反面的机会均等),另一枚是有偏差的(显示正面比反面更有可能)。如果我们翻转公平的硬币和有偏差的硬币,这两个事件在统计上是独立的,因为一枚硬币翻转的结果不会改变概率另一枚硬币正面或反面朝上。具体来说,两枚硬币正面朝上的可能性是各个概率的乘积:(1/2) * (3/4) = 3/8。
Consider this example to clarify the concept of statistical independence: imagine we possess two coins, one fair (an equal chance of turning up heads or tails) and the other biased (showing a head is more likely than a tail). If we flip the fair coin and the biased coin, these two events are statistically independent because the outcome of one coin flip doesn’t alter the probability of the other coin turning up heads or tails. Specifically, the likelihood of both coins showing heads is the product of the individual probabilities: (1/2) * (3/4) = 3/8.
统计独立性是统计学和概率论中的一个关键概念,在机器学习中经常被用来概述数据集中变量之间的联系。通过理解这些关系,机器学习算法可以更好地发现模式并提供更精确的预测。我们将在下面描述不同类型事件之间的关系:
Statistical independence is a pivotal concept in statistics and probability theory, frequently leveraged in machine learning to outline the connections between variables within a dataset. By comprehending these relationships, machine learning algorithms can better spot patterns and deliver more precise predictions. We will describe the relationship between different types of events in the following:
接下来,我们将描述离散随机变量、它的分布以及如何使用它来计算概率。
Next, we are going to describe the discrete random variable, its distribution, and how to use it to calculate the probabilities.
离散随机变量指的是可以假设有限或可数无限数量的潜在结果的变量。此类变量的示例可能是抛硬币产生的正面计数、特定时间范围内通过收费站的汽车计数或教室中金发学生的数量。
A discrete random variable refers to a variable that can assume a finite or countably infinite number of potential outcomes. Examples of such variables might be the count of heads resulting from a coin toss, the tally of cars crossing a toll booth within a specific time span, or the number of blonde-haired students in a classroom.
a 的概率分布离散随机变量为变量可能采用的每个潜在结果分配一定的可能性。例如,在抛硬币的情况下,概率分布为0和1分配 0.5 的概率,分别代表反面和正面。对于汽车收费站场景,分布可以将0.1的概率分配给没有车辆通过,为一辆汽车分配0.3,为两辆汽车分配0.4 ,为三辆汽车分配0.15,为四辆或更多汽车分配0.05 。
The probability distribution of a discrete random variable assigns a certain likelihood to each potential outcome the variable could adopt. For instance, in the case of a coin toss, the probability distribution assigns a 0.5 probability to both 0 and 1, representing tails and heads, respectively. For the car toll booth scenario, the distribution could be assigning a probability of 0.1 to no cars passing, 0.3 to one car, 0.4 to two cars, 0.15 to three cars, and 0.05 to four or more cars.
离散随机变量概率分布的图形表示可以通过概率质量函数( PMF ) 来实现,该函数将每个可能的结果关联起来变量发生的可能性。该函数通常表示为条形图或直方图,每个条形表示特定值的概率。
A graphical representation of the probability distribution of a discrete random variable can be achieved through a probability mass function (PMF), which correlates each possible outcome of the variable to its likelihood of occurrence. This function is usually represented as a bar chart or histogram, with each bar signifying the probability of a specific value.
PMF 受两个关键原则的约束:
The PMF is bound by two key principles:
离散随机变量的期望值可以洞察其集中趋势,计算为其可能结果的概率加权平均值。该期望值表示为E[X],其中X代表随机变量。
The expected value of a discrete random variable offers an insight into its central tendency, computed as the probability-weighted average of its possible outcomes. This expected value is signified as E[X], with X representing the random variable.
概率密度函数(PDF )是用于描述连续随机变量分布的工具。有可能用于计算某个值落在特定范围内的概率。简单来说,它有助于确定连续变量X的值在区间 [ a, b ],或者用统计术语来说,
The probability density function (PDF) is a tool used to describe the distribution of a continuous random variable. It can be used to calculate the probability of a value falling within a specific range. In simpler terms, it helps determine the chances of a continuous variable, X, having a value within the interval [a, b], or in statistical terms,
对于连续变量,单个值出现的概率始终为 0,这与离散变量相反,离散变量可以将非0概率分配给不同的值。PDF 提供了一种估计值落在给定范围(而不是单个值)内的可能性的方法。
For continuous variables, the probability of a single value occurring is always 0, which is in contrast to discrete variables that can assign non-0 probabilities to distinct values. PDFs provide a way to estimate the likelihood of a value falling within a given range instead of a single value.
例如,您可以使用 PDF 来查找下一个测量的 IQ 分数落在100 到120之间的机会。
For example, you can use a PDF to find the chances of the next IQ score measured falling between 100 and 120.
图 2.2 – IQ 为 100–120 的概率密度函数
Figure 2.2 – Probability density function for IQ from 100–120
要确定离散随机变量的分布,可以提供其 PMF 或累积分布函数( CDF )。对于连续随机变量,我们主要使用 CDF,因为它很好已确立的。然而,PMF 不适合这些类型的变量因为对于实数集中的所有x , P(X=x)等于0 ,假设X可以采用a和b之间的任何实数值。因此,我们通常定义 PDF。PDF类似于物理学中质量密度的概念,表示概率的集中度。它的单位是单位长度的概率。为了理解 PDF,我们分析一个连续随机变量X,并建立函数fX(x)如下:
To ascertain the distribution of a discrete random variable, one can either provide its PMF or cumulative distribution function (CDF). For continuous random variables, we primarily utilize the CDF, as it is well established. However, the PMF is not suitable for these types of variables because P(X=x) equals 0 for all x in the set of real numbers, given that X can assume any real value between a and b. Therefore, we typically define the PDF instead. The PDF resembles the concept of mass density in physics, signifying the concentration of probability. Its unit is the probability per unit length. To get a grasp of the PDF, let’s analyze a continuous random variable, X, and establish the function fX(x) as follows:
如果极限存在的话。
If the limit exists.
该函数提供给定点x 的概率密度。这相当于区间(x, x + Δ]的概率与区间长度之比随着长度接近 0 的极限。
The function provides the probability density at a given point, x. This is equivalent to the limit of the ratio of the probability of the interval (x, x + Δ] to the length of the interval as that length approaches 0.
让我们考虑一个连续随机变量X,它具有绝对连续的 CDF,表示为。如果在x处可微,则该函数称为X的PDF :
Let’s contemplate a continuous random variable, X, possessing an absolutely continuous CDF, denoted as . If is differentiable at x, the function is referred to as the PDF of X:
假设在 处 可微。
Assuming is differentiable at .
例如,让我们考虑具有均匀分布的连续均匀随机变量X。其 CDF由下式给出:
For example, let’s consider a continuous uniform random variable, X, with uniform distribution. Its CDF is given by:
对于任何超出范围的 x ,其值为0 。
which is 0 for any x outside the bounds.
通过积分,可以从PDF得到CDF :
By using integration, the CDF can be obtained from the PDF:
此外,我们还有
Additionally, we have
因此,如果我们对整条实数线进行积分,我们将得到 1:
So, if we integrate over the entire real line, we will get 1:
明确地说,当在整个实数轴上积分 PDF 时,结果应等于1。这表示 PDF 曲线下方的面积必须等于1或P(S) = 1,这对于均匀分布仍然成立。PDF表示概率密度;因此,它必须是非负数并且可以超过1。
Explicitly, when integrating the PDF across the entire real number line, the result should equal 1. This signifies that the area beneath the PDF curve must equate to 1, or P(S) = 1, which remains true for the uniform distribution. The PDF signifies the density of probability; thus, it must be non-negative and can exceed 1.
考虑一个连续随机变量X,其中 PDF 表示为。以下属性适用:
Consider a continuous random variable, X, with PDF represented as . The ensuing properties are applicable:
对于所有实数 x
for all real x
接下来,我们将继续讨论最大似然。
Next, we’ll move on to cover maximum likelihood.
最大似然是统计方法,用于估计概率分布的参数。这目标是确定使观察数据的可能性最大化的参数值,本质上是确定最有可能生成数据的参数。
Maximum likelihood is a statistical approach, that is used to estimate the parameters of a probability distribution. The objective is to identify the parameter values that maximize the likelihood of observing the data, essentially determining the parameters most likely to have generated the data.
假设我们有一个随机样本 ,来自具有概率分布 的总体,其中θ是参数向量。在给定参数θ 的情况下,观察样本X的可能性定义为观察每个数据点的各个概率的乘积:
Suppose we have a random sample, , from a population with a probability distribution , where θ is a vector of parameters. The likelihood of observing the sample, X, given the parameters, θ, is defined as the product of the individual probabilities of observing each data point:
在具有独立且同分布的观测值的情况下,似然函数可以表示为单变量密度函数的乘积,每个密度函数在相应的观测值上进行评估:
In case of having independent and identically distributed observations, the likelihood function can be expressed as the product of the univariate density functions, each evaluated at the corresponding observation:
最大似然估计( MLE ) 是提供似然最大值的参数向量值跨参数空间的函数。
The maximum likelihood estimate (MLE) is the parameter vector value that offers the maximum value for the likelihood function across the parameter space.
在许多情况下,使用似然函数的自然对数更方便,称为作为对数似然。对数似然的峰值出现在与似然函数最大值相同的参数向量值处,并且通过将每个参数的对数似然导数等于 0 来获得最大值(或最小值)所需的条件。如果对数似然可微分参数,这些条件产生一组方程,可以通过数值求解来导出 MLE。MLE 显着影响 ML 的一种常见用例或场景模型性能是线性回归的。在构建线性回归模型时,MLE 通常用于估计定义输入特征与目标变量之间关系的系数。MLE 有助于找到在假设的线性回归模型下最大化观察给定数据的可能性的系数值,从而提高预测的准确性。
In many cases, it’s more convenient to employ the natural logarithm of the likelihood function, referred to as the log-likelihood. The peak of the log-likelihood happens at the identical parameter vector value as the likelihood function’s maximum, and the conditions required for a maximum (or minimum) are acquired by equating the log-likelihood derivatives with respect to each parameter to 0. If the log-likelihood is differentiable with respect to the parameters, these conditions result in a set of equations that can be solved numerically to derive the MLE. One common use case or scenario where MLE significantly impacts ML model performance is in linear regression. When building a linear regression model, MLE is often used to estimate the coefficients that define the relationship between input features and the target variable. MLE helps find the values for the coefficients that maximize the likelihood of observing the given data under the assumed linear regression model, improving the accuracy of the predictions.
参数θ的 MLE是使似然函数最大化的值。换句话说,MLE 是使观测数据X最可能的θ值。
The MLEs of the parameters, θ, are the values that maximize the likelihood function. In other words, the MLEs are the values of θ that make the observed data, X, most probable.
为了找到 MLE,我们通常采用似然函数的自然对数,因为使用乘积的对数通常比使用乘积本身更容易:
To find the MLEs, we typically take the natural logarithm of the likelihood function, as it is often easier to work with the logarithm of a product than with the product itself:
MLE 是通过将每个参数的对数似然函数的偏导数等于0,然后求解参数的这些方程来确定的:
The MLEs are determined by equating the partial derivatives of the log-likelihood function with respect to each parameter to 0 and then solving these equations for the parameters:
...
...
其中k是θ中的参数数量。最大似然估计器的目标是找到θ使得
where k is the number of parameters in θ. The goal of a maximum likelihood estimator is to find θ such that
一旦找到 MLE,它们就可以用于根据样本数据对总体进行预测。最大似然法广泛应用于许多领域,包括心理学、经济学、工程学和生物学。它是理解联系的有力工具变量之间以及根据观察到的数据预测结果。例如,使用最大似然估计构建单词预测器。
Once the MLEs have been found, they can be used to make predictions about the population based on the sample data. Maximum likelihood is widely used in many fields, including psychology, economics, engineering, and biology. It serves as a potent tool for comprehending the connections among variables and for predicting outcomes based on observed data. For example, building a word predictor using maximum likelihood estimation.
接下来我们就来介绍一下单词自动完成问题,也称为单词预测,这是应用程序的一个功能预测用户输入的下一个单词。单词预测的目的是通过根据用户之前的输入和其他上下文因素预测用户接下来可能输入的内容,从而节省时间并简化输入。单词预测可以以多种形式出现在许多应用程序中,包括搜索引擎、文本编辑器和移动设备键盘,旨在节省时间并提高输入的准确性。
Next, we introduce the problem of word autocompletion, also known as word prediction, which is a feature in where an application predicts the next word a user is typing. The aim of word prediction is to save time and make typing easier by predicting what the user is likely to type next based on their previous inputs and other contextual factors. Word prediction can be found in various forms in many applications, including search engines, text editors, and mobile device keyboards, and is designed to save time and increase the accuracy of inputs.
给定用户输入的一组单词,我们如何建议下一个单词?
Given a group of words that the user typed, how would we suggest the next word?
如果这些词是The United States of,那么假设下一个词将是America就很简单了。但是,如何找到How are的下一个单词呢?人们可以提出接下来的几个词。
If the words were The United States of, then it would be trivial to assume that the next word would be America. However, what about finding the next word for How are? One could suggest several next words.
通常不只有一个明确的下一个词。因此,我们想要建议最可能的单词,甚至可能是最可能的单词。在这种情况下,我们有兴趣建议下一个可能出现的单词的概率表示,并选择下一个单词作为最有可能的单词。
There usually isn’t just one clear next word. Thus, we’d want to suggest the most likely word or perhaps even the most likely words. In that case, we would be interested in suggesting a probabilistic representation of the possible next words and picking the next word as the one that is most probable.
最大似然估计器为我们提供了这种精确的能力。它可以告诉我们根据用户之前输入的单词哪个单词最有可能。
The maximum likelihood estimator provides us with that precise capability. It can tell us which word is most probable given the previous words that the user typed.
为了计算 MLE,我们需要计算所有单词组合的概率函数。我们能做到这一点通过处理大型文本并计算每个单词组合存在的次数。
In order to calculate the MLE, we need to calculate the probability function of all word combinations. We can do that by processing large texts and counting how many times each combination of words exists.
考虑查看大量出现以下情况的文本:
Consider reviewing a large cohort of text that has the following occurrences:
|
“你” “you” |
“他们” “they” |
“那些” “those” |
“这” “the” |
任何其他词 Any other word |
|
|
“怎样…” “how are …” |
16 16 |
14 14 |
0 0 |
100 100 |
10 10 |
|
不是“怎么样……” not “how are…” |
200 200 |
100 100 |
300 300 |
1,000 1,000 |
30,000 30,000 |
表 2.1 – 文档中出现的 n 元语法示例
Table 2.1 – Sample of n-grams occurrences in a document
例如,文本中出现了 16 次“how are you”序列。有140个长度为 3、以“how are”开头的序列。计算公式为:
For instance, there are 16 occurrences in the text where the sequence “how are you” appears. There are 140 sequences that have a length of three that start with the words “how are.” That is calculated as:
有 216 个长度为 3 并以单词“you”结尾的序列。计算公式为:
There are 216 sequences that have a length of three and that end with the word “you”. That is calculated as:
现在,我们为最有可能的下一个单词建议一个公式。
Now, let’s suggest a formula for the most likely next word.
基于概率变量 的常见最大似然估计,公式将是找到一个最大化的值:
Based on the common maximum likelihood estimation for the probablistic variable , the formula would be to find a value for which maximizes:
然而,这个通用公式有一些对我们的应用不利的特征。
However, this common formula has a few characteristics that wouldn’t be advantagous to our application.
考虑下一个公式,它具有我们的用例所需的特定优势。它是参数估计的最大似然公式,意思是估计确定性参数。它建议找到一个最大化的值:
Consider the next formula which has specific advantages that are necessary for our use case. It is the maximum likelihood formula for parametric estimation, meaning, estimating deterministic parameters. It suggests finding a value for which maximizes:
绝不是一个确定性参数,但是,这个公式适合我们的用例,因为它减少了强调上下文契合的常见单词偏差,并根据单词特异性进行调整,从而增强了我们预测的相关性。我们将在以下内容中详细阐述这些特征本次练习的结论。
is by no means a deterministic parameter, however, this formula suits our use case as it reduces common word bias emphasizing contextual fit, and adjusts for word specificity, thus enhancing the relevance of our predictions. We will elaborate more on these traits in the conclusion of this exercise.
让我们改进这个公式,使其更容易计算:
Let’s enhance this formula so to make it easier to calculate:
在我们的例子中,是“如何”和“是”。
In our case, is “how” and is “are.”
下一个单词有五个候选者;让我们计算每个人的概率:
There are five candidates for the next word; let’s calculate the probability for each of them:
在所有选项中,概率的最高值为 7/57,并且当“they”是下一个单词时达到该概率。
Out of all the options, the highest value of probability is 7/57 and it is achieved when “they” is the next word.
请注意,最大似然估计器背后的直觉是让建议的下一个单词成为用户最有可能键入的单词。人们可能会想,为什么不采用给定前两个单词的最可能的单词,即概率变量的原始最大似然公式?从表中我们可以看到,给定单词“how are”,最常见的第三个单词是“the”,概率为 100/140。然而,这种方法没有考虑到“the”一词非常普遍的事实,因为它通常在文本中最常用。因此,它的高频并不是因为它与前两个词的关系;而是因为它与前两个词的关系。这是因为它是简直就是一个非常一般的常用词。我们选择的最大似然公式考虑到了这一点。
Note that the intuition behind this maximum likelihood estimator is having the suggested next word make the words that the user typed most likely. One could wonder, why not take the word that is most probable given the first two words, meaning, the orginal maximum likelihood formula for probabilistic variables? From the table, we see that given the words “how are,” the most frequent third word is “the,” with a probability of 100/140. However, this approach wouldn’t take into account the fact that the word “the” is extremely prevalent altogether, as it is most frequently used in the text in general. Thus, its high frequency isn’t due to its relationship to the first two words; it is because it is simply a very common word in general. The maximum likelyhood formula we chose takes that into account.
贝叶斯估计是一种统计方法涉及根据新数据更新我们对感兴趣数量的信念或概率的方法。“贝叶斯”一词指的是 18 世纪的统计学家托马斯·贝叶斯 (Thomas Bayes),他首先提出了贝叶斯概率的概念。
Bayesian estimation is a statistical approach that involves updating our beliefs or probabilities about a quantity of interest based on new data. The term “Bayesian” refers to Thomas Bayes, an 18th-century statistician who first developed the concept of Bayesian probability.
在贝叶斯估计中,我们从有关感兴趣数量的先验信念开始,这些信念被表示为概率分布。当我们收集新数据时,这些先前的信念就会更新。更新的信念表示为后验分布。贝叶斯框架提供了一种用新数据更新先验信念的系统方法,同时考虑到先验信念和新数据的不确定性程度。
In Bayesian estimation, we start with prior beliefs about the quantity of interest, which are expressed as a probability distribution. These prior beliefs are updated as we collect new data. The updated beliefs are represented as a posterior distribution. The Bayesian framework provides a systematic way of updating prior beliefs with new data, taking into account the degree of uncertainty in both the prior beliefs and the new data.
后验分布是使用贝叶斯定理计算的,这是贝叶斯估计的基本方程。贝叶斯定理指出
The posterior distribution is calculated using Bayes’ theorem, which is the fundamental equation of Bayesian estimation. Bayes’ theorem states that
其中θ是感兴趣的数量,X是新数据,P(θ|X)是后验分布,P(X|θ)是给定参数值的数据的可能性,P(θ)是先验分布,P(X)是边际可能性或证据。
where Θ is the quantity of interest, X is the new data, P(Θ|X) is the posterior distribution, P(X|Θ) is the likelihood of the data given the parameter value, P(Θ) is the prior distribution, and P(X) is the marginal likelihood or evidence.
边际似然计算如下:
The marginal likelihood is calculated as follows:
其中积分是在θ的整个空间上进行的。边际似然通常用作归一化常数,确保后验分布积分为1。
where the integral is taken over the entire space of Θ. The marginal likelihood is often used as a normalizing constant, ensuring that the posterior distribution integrates to 1.
在贝叶斯估计中,先验分布的选择很重要,因为它反映了我们在收集任何数据之前对感兴趣数量的信念。可以根据先验知识或先前的研究来选择先验分布。如果没有先验知识可用,则可以使用非信息先验,例如均匀分布。
In Bayesian estimation, the choice of prior distribution is important, as it reflects our beliefs about the quantity of interest before collecting any data. The prior distribution can be chosen based on prior knowledge or previous studies. If no prior knowledge is available, a non-informative prior can be used, such as a uniform distribution.
一旦后路计算分布后,它可以用来预测感兴趣的数量。例如,后验分布的均值可以用作点估计,而后验分布本身可以用来建立可信区间。这些区间代表目标数量真实值所在的可能范围。
Once the posterior distribution is calculated, it can be used to make predictions about the quantity of interest. As an example, the posterior distribution’s mean can serve as a point estimate, whereas the posterior distribution itself can be employed to establish credible intervals. These intervals represent the probable range within which the true value of the target quantity resides.
本章介绍了机器学习的线性代数和概率,涵盖了对于理解许多机器学习算法至关重要的基本数学概念。本章首先回顾了线性代数,涵盖了矩阵乘法、行列式、特征向量和特征值等主题。然后继续讨论概率论,介绍随机变量和概率分布的基本概念。我们还介绍了统计推断中的关键概念,例如最大似然估计和贝叶斯推断。
This chapter was about linear algebra and probability for ML, and it covers the fundamental mathematical concepts that are essential to understanding many machine learning algorithms. The chapter began with a review of linear algebra, covering topics such as matrix multiplication, determinants, eigenvectors, and eigenvalues. It then moved on to discuss probability theory, introducing the basic concepts of random variables and probability distributions. We also covered key concepts in statistical inference, such as maximum likelihood estimation and Bayesian inference.
在下一章中,我们将介绍 NLP 机器学习的基础知识,包括数据探索、特征工程、选择方法以及模型训练和验证等主题。
In the next chapter, we will cover the fundamentals of machine learning for NLP, including topics such as data exploration, feature engineering, selection methods, and model training and validation.
补充阅读内容请如下:
Please find the additional reading content as follows:
这里,I是单位矩阵,u是定义反射平面的单位向量。
Here, I is the identity matrix, and u is a unit vector defining the reflection plane.
Householder 变换的主要目的是执行 QR 分解并将矩阵简化为三对角或 Hessenberg 形式。对称和正交的特性使得 Householder 矩阵计算高效且数值稳定。
The main purpose of Householder transformations is to perform QR factorization and to reduce matrices to a tridiagonal or Hessenberg form. The properties of being symmetric and orthogonal make the Householder matrix computationally efficient and numerically stable.
Tr(A) = a + d。
Tr(A) = a + d.
在本章中,我们将深入研究机器学习( ML ) 的基础知识和对于自然语言处理( NLP ) 任务至关重要的预处理技术。 ML 是构建可以从数据中学习的模型的强大工具,而 NLP 是ML 最令人兴奋和最具挑战性的应用之一。
In this chapter, we will delve into the fundamentals of Machine Learning (ML) and preprocessing techniques that are essential for natural language processing (NLP) tasks. ML is a powerful tool for building models that can learn from data, and NLP is one of the most exciting and challenging applications of ML.
在本章结束时,您将对数据探索、预处理和数据分割有一个全面的了解,知道如何处理不平衡数据技术,并了解成功机器学习所需的一些常见机器学习模型,特别是在NLP 的背景。
By the end of this chapter, you will have gained a comprehensive understanding of data exploration, preprocessing, and data split, know how to deal with imbalanced data techniques, and learned about some of the common ML models required for successful ML, particularly in the context of NLP.
本章将涵盖以下主题:
The following topics will be covered in this chapter:
本书的本章和后续章节假定您具备编程语言(尤其是 Python)的先验知识。我们还希望您已经阅读了前面的章节,以熟悉将详细讨论的必要线性代数和统计概念。
Prior knowledge of programming languages, particularly Python, is assumed in this chapter and subsequent chapters of this book. It is also expected that you have already gone through previous chapters to become acquainted with the necessary linear algebra and statistics concepts that will be discussed in detail.
在方法论环境中工作时,数据集通常是众所周知的并经过预处理,例如Kaggle 数据集。然而,在现实世界的业务环境中,一项重要任务是从所有可能的数据源中定义数据集,探索收集到的数据以找到对其进行预处理的最佳方法,并最终决定适合的 ML 和自然语言模型。问题和底层数据最好。此过程需要仔细考虑和分析数据,以及对当前业务问题的透彻理解。
When working in a methodological environment, datasets are often well known and preprocessed, such as Kaggle datasets. However, in real-world business environments, one important task is to define the dataset from all possible sources of data, explore the gathered data to find the best method for preprocessing it, and ultimately decide on the ML and natural language models that fit the problem and the underlying data best. This process requires careful consideration and analysis of the data, as well as a thorough understanding of the business problem at hand.
在 NLP 中,数据可能非常复杂,因为它通常包括非结构化且难以分析的文本和语音数据。这种复杂性使得预处理成为机器学习模型准备数据的重要步骤。任何 NLP 或 ML 解决方案的第一步都是从探索数据以了解更多信息开始,这有助于我们决定解决问题的路径。
In NLP, the data can be quite complex, as it often includes text and speech data that can be unstructured and difficult to analyze. This complexity makes preprocessing an essential step in preparing the data for ML models. The first step of any NLP or ML solution starts with exploring the data to learn more about it, which helps us decide on our path to tackle the problem.
数据经过预处理后,下一步就是对其进行探索,以更好地了解其特征和结构。数据探索是一个迭代过程,涉及可视化和分析数据、寻找模式和关系以及识别潜在问题或异常值。此过程可以帮助我们确定哪些特征对于我们的 ML 模型最重要,并识别任何潜在的偏差或数据质量问题。为了通过机器学习模型简化数据并增强分析,可以采用标记化、词干化和词形还原等预处理方法。在本章中,我们将概述机器学习问题的一般预处理技术。在下一章中,我们将深入研究特定于文本处理的预处理技术。值得注意的是,采用有效的预处理技术可以显着提高机器学习模型的性能和准确性,使它们更加稳健和可靠。
Once the data has been preprocessed, the next step is to explore it to gain a better understanding of its characteristics and structure. Data exploration is an iterative process that involves visualizing and analyzing the data, looking for patterns and relationships, and identifying potential issues or outliers. This process can help us to determine which features are most important for our ML models and identify any potential biases or data quality issues. To streamline data and enhance analysis through ML models, preprocessing methods such as tokenization, stemming, and lemmatization can be employed. In this chapter, we will provide an overview of general preprocessing techniques for ML problems. In the following chapter, we will delve into preprocessing techniques specific to text processing. It is important to note that employing effective preprocessing techniques can significantly enhance the performance and accuracy of ML models, making them more robust and reliable.
最后,一旦数据经过预处理和探索,我们就可以开始构建我们的机器学习模型。没有一种神奇的解决方案可以解决所有机器学习问题,因此仔细考虑哪些模型最适合数据和当前的问题非常重要。存在不同类型的 NLP 模型,包括基于规则的模型、统计模型和深度学习模型。每种模型类型都具有独特的优点和缺点,这强调了为当前的特定问题和数据集选择最合适的模型的重要性。
Finally, once the data has been preprocessed and explored, we can start building our ML models. There is no single magical solution that works for all ML problems, so it’s important to carefully consider which models are best suited for the data and the problem at hand. Different types of NLP models exist, encompassing rule-based, statistical, and deep learning models. Each model type possesses unique strengths and weaknesses, underscoring the importance of selecting the most fitting one for the specific problem and dataset at hand.
数据探索是 ML 工作流程中重要的初始步骤,涉及在构建 ML 模型之前分析和理解数据。数据探索的目标是深入了解数据、识别模式、检测异常并准备数据造型。数据探索有助于选择正确的机器学习算法并确定要使用的最佳功能集。
Data exploration is an important and initial step in the ML workflow that involves analyzing and understanding the data before building a ML model. The goal of data exploration is to gain insights about the data, identify patterns, detect anomalies, and prepare the data for modeling. Data exploration helps in choosing the right ML algorithm and determining the best set of features to use.
以下是数据探索中使用的一些常见技术:
Here are some common techniques that are used in data exploration:
We will explore each of these techniques in the following subsections.
数据可视化是一个机器学习的重要组成部分,因为它使我们能够更轻松地理解和探索复杂的数据集。它涉及创建使用图表、图形和其他类型的视觉辅助工具对数据进行可视化表示。通过直观地呈现数据,我们可以辨别单独检查原始数据时可能不容易明显的模式、趋势和关系。
Data visualization is a crucial component of machine learning as it allows us to understand and explore complex datasets more easily. It involves creating visual representations of data using charts, graphs, and other types of visual aids. By visually presenting data, we can discern patterns, trends, and relationships that might not be readily evident when examining the raw data alone.
对于 NLP 任务,数据可视化可以帮助我们深入了解文本数据中的语言模式和结构。例如,我们可以创建词云来可视化语料库中单词的频率,或者使用热图来显示单词或短语的共现情况。我们还可以使用散点图和折线图来可视化情绪或主题随时间的变化。
For NLP tasks, data visualization can help us gain insights into the linguistic patterns and structures in text data. For example, we can create word clouds to visualize the frequency of words in a corpus or use heatmaps to display the co-occurrence of words or phrases. We can also use scatter plots and line graphs to visualize changes in sentiment or topic over time.
机器学习的一种常见可视化类型是散点图,它用于显示两个变量之间的关系。通过在 X 轴和 Y 轴上绘制两个变量的值,我们可以识别它们之间存在的任何模式或趋势。散点图对于识别具有相似特征的簇或数据点组特别有用。
One common type of visualization for ML is the scatter plot, which is used to display the relationship between two variables. By plotting the values of two variables on the X and Y axes, we can identify any patterns or trends that exist between them. Scatter plots are particularly useful for identifying clusters or groups of data points that share similar characteristics.
机器学习中经常使用的另一种可视化类型是直方图,它是一种说明单个变量分布的工具。通过将数据分组到每个箱中并描绘每个箱中数据点的频率,我们可以查明在数据中占主导地位的值的范围数据集。事实证明,直方图对于检测异常值或异常很有用,并且有助于识别数据可能表现出偏斜或偏差的区域。
Another type of visualization that’s frequently employed in ML is the histogram, a tool that illustrates the distribution of a single variable. By grouping data into bins and portraying the frequency of data points in each bin, we can pinpoint the range of values that predominate in the dataset. Histograms prove useful for detecting outliers or anomalies, and they aid in recognizing areas where the data may exhibit skewness or bias.
除了这些基本的可视化方面,机器学习从业者经常使用更先进的技术,例如降维和网络可视化。降维技术,例如主成分分析(PCA)和t 分布随机邻域嵌入(t-SNE),通常用于降维和可视化或更轻松地分析数据。另一方面,网络可视化用于显示实体之间的复杂关系,例如单词的共现或单词之间的连接社交媒体用户。
In addition to these basic visualizations, ML practitioners often use more advanced techniques, such as dimensionality reduction and network visualizations. Dimensionality reduction techniques, such as principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE), are commonly used for dimensional reduction and to visualize or analyze the data more easily. Network visualizations, on the other hand, are used to display complex relationships between entities, such as the co-occurrence of words or the connections between social media users.
数据清理,也称为数据清理或数据清理,涉及识别和纠正或消除错误、不一致和不准确在数据集中。机器学习数据准备的这一关键阶段会显着影响模型的准确性和性能,具体取决于用于训练的数据的质量。数据清理中采用了许多流行的技术。让我们细看。
Data cleaning, alternatively termed data cleansing or data scrubbing, involves recognizing and rectifying or eliminating errors, inconsistencies, and inaccuracies within a dataset. This crucial phase in data preparation for ML significantly influences the accuracy and performance of a model, relying on the quality of the data used for training. Numerous prevalent techniques are employed in data cleaning. Let’s take a closer look.
缺失数据是许多机器学习项目中出现的常见问题。处理缺失数据很重要,因为机器学习模型无法处理缺失数据,并且会产生错误或提供不准确的结果。
Missing data is a common problem that occurs in many machine learning projects. Dealing with missing data is important because ML models cannot handle missing data and will either produce errors or provide inaccurate results.
有多种方法可以处理ML 项目中的缺失数据:
There are several methods for dealing with missing data in ML projects:
从本质上讲,选择处理缺失数据的方法取决于缺失数据的性质和范围、分析目标和资源可用性等因素。仔细评估每种方法的优缺点并选择最合适的方法是至关重要的d.具体项目。
In essence, selecting a method to handle missing data hinges on factors such as the nature and extent of the missing data, analysis objectives, and resource availability. It is crucial to thoughtfully assess the pros and cons of each method and opt for the most suitable approach tailored to the specific project.
消除重复是一种普遍现象用于通过检测和删除相同记录来清理数据集的预处理措施。重复记录的出现可能是由于数据输入错误、系统故障或数据合并过程等因素造成的。重复项的存在可能会扭曲模型并产生不准确的见解。因此,必须识别并消除重复记录,以维护数据集的准确性和可靠性。
Eliminating duplicates is a prevalent preprocessing measure that’s employed to cleanse datasets by detecting and removing identical records. The occurrence of duplicate records may be attributed to factors such as data entry errors, system glitches, or data merging processes. The presence of duplicates can skew models and yield inaccurate insights. Hence, it is imperative to recognize and eliminate duplicate records to uphold the accuracy and dependability of the dataset.
有多种方法可以删除数据集中的重复项。最常见的方法是比较数据集的所有行以识别重复记录。如果两行或多行在所有列中具有相同的值,则它们被视为重复。在某些情况下,如果某些列更容易出现重复,则可能需要仅比较列的子集。
There are different methods for removing duplicates in a dataset. The most common method is to compare all the rows of the dataset to identify duplicate records. If two or more rows have the same values in all the columns, they are considered duplicates. In some cases, it may be necessary to compare only a subset of columns if certain columns are more prone to duplicates.
另一种方法是使用唯一标识符列来识别重复项。唯一标识符列是包含每个记录的唯一值的列,例如 ID 号或唯一列的组合。通过比较唯一标识符列,可以识别并删除数据集中的重复记录。
Another method is to use a unique identifier column to identify duplicates. A unique identifier column is a column that contains unique values for each record, such as an ID number or a combination of unique columns. By comparing the unique identifier column, it is possible to identify and remove duplicate records from the dataset.
识别重复记录后,下一步是决定保留哪些记录以及删除哪些记录。一方法是保留第一次出现的重复记录并删除所有后续出现的记录。另一种方法是保留信息最完整的记录,或者具有最新时间戳的记录。
After identifying the duplicate records, the next step is to decide which records to keep and which ones to remove. One approach is to keep the first occurrence of a duplicate record and remove all subsequent occurrences. Another approach is to keep the record with the most complete information, or the record with the most recent timestamp.
重要的是要认识到,删除重复项可能会导致数据集大小减小,从而可能影响机器学习模型的性能。因此,评估重复删除对数据集和机器学习模型的影响至关重要。在某些情况下,如果记录包含无法被删除的重要信息,则可能有必要保留重复记录。从其他记录中获得。
It’s crucial to recognize that the removal of duplicates might lead to a reduction in dataset size, potentially affecting the performance of ML models. Consequently, assessing the impact of duplicate removal on both the dataset and the ML model is essential. In some cases, it may be necessary to keep duplicate records if they contain important information that cannot be obtained from other records.
标准化和转换数据是为机器学习任务准备数据的关键步骤。此过程涉及对数据集的数值特征进行缩放和标准化,以使它们更容易解释和比较。标准化的主要目标数据转换是通过减轻不同尺度特征的影响来提高 ML 模型的准确性和性能范围。一种广泛使用的数据标准化方法称为“标准化”或“Z 分数”正常化。”该技术涉及对每个特征进行变换,使其平均值为零,标准差为一。标准化的公式如下所示:
Standardizing and transforming data is a critical step in preparing data for ML tasks. This process involves scaling and normalizing the numerical features of the dataset to make them easier to interpret and compare. The main objective of standardizing and transforming data is to enhance the accuracy and performance of a ML model by mitigating the influence of features with diverse scales and ranges. A widely used method for standardizing data is referred to as “standardization” or “Z-score normalization.” This technique involves transforming each feature such that it has a mean of zero and a standard deviation of one. The formula for standardization is shown in the following equation:
这里,x表示特征,mean(x)表示特征的均值,std(x)表示特征的标准差,x'表示分配给特征的新值。通过这种方式对数据进行标准化,每个特征的范围被调整为以零为中心,这使得比较特征变得更容易,并防止大值特征主导分析。
Here, x represents the feature, mean(x) denotes the mean of the feature, std(x) indicates the standard deviation of the feature, and x’ represents the new value assigned to the feature. By standardizing the data in this way, the range of each feature is adjusted to be centered around zero, which makes it easier to compare features and prevents features with large values from dominating the analysis.
另一种数据转换技术是“最小-最大缩放”。此方法将数据重新调整为一致的值范围,通常介于 0 和 1 之间。最小-最大缩放的公式如下所示:
Another technique for transforming data is “min-max scaling.” This method rescales the data to a consistent range of values, commonly ranging between 0 and 1. The formula for min-max scaling is shown here:
在此等式中,x表示特征,min(x)表示特征的最小值,max(x)表示特征的最大值。当数据的精确分布并不重要时,最小-最大缩放被证明是有益的,但需要标准化数据以便在不同特征之间进行有意义的比较。
In this equation, x represents the feature, min(x) signifies the minimum value of the feature, and max(x) denotes the maximum value of the feature. Min-max scaling proves beneficial when the precise distribution of the data is not crucial, but there is a need to standardize the data for meaningful comparisons across different features.
转换数据还可能涉及更改数据的分布。经常应用的变换是对数变换,用于减轻数据中异常值和偏度的影响。该变换涉及对特征值取对数,可以有助于规范分布并减少极值的影响。
Transforming data can also involve changing the distribution of the data. A frequently applied transformation is the log transformation, which is employed to alleviate the influence of outliers and skewness within the data. This transformation involves taking the logarithm of the feature values, which can help to normalize the distribution and reduce the influence of extreme values.
总体而言,数据标准化和转换构成了机器学习数据预处理工作流程的关键阶段。通过缩放和归一化特征,我们可以提高 ML 模型的准确性和性能,使数据更具可解释性和协作性。有利于进行有意义的比较。
Overall, standardizing and transforming data constitute a pivotal stage in the data preprocessing workflow for ML endeavors. Through scaling and normalizing features, we can enhance the accuracy and performance of the ML model, rendering the data more interpretable and conducive to meaningful comparisons.
离群值是明显偏离数据集中其他观测值的数据点。它们的发生可能源于测量误差、数据损坏或真实性等因素极端值。异常值的存在会对机器学习模型的结果产生重大影响,导致数据失真并破坏变量之间的关系。因此,处理异常值是机器学习数据预处理的重要步骤。
Outliers are data points that markedly deviate from the rest of the observations in a dataset. Their occurrence may stem from factors such as measurement errors, data corruption, or authentic extreme values. The presence of outliers can wield a substantial influence on the outcomes of ML models, introducing distortion to the data and disrupting the relationships between variables. Therefore, handling outliers is an important step in preprocessing data for ML.
处理异常值的方法有以下几种:
There are several methods for handling outliers:
需要强调的是,选择异常值处理方法应根据数据的独特特征和当前的具体问题进行定制。一般来说,建议采用多种方法的组合来全面解决异常值,并且评估每种方法对结果的影响至关重要。此外,记录管理异常值所采取的步骤对于再现性和提供清晰的信息非常重要。决策过程的重要性。
It’s crucial to emphasize that selecting an outlier-handling method should be tailored to the unique characteristics of the data and the specific problem at hand. Generally, employing a combination of methods is advisable to address outliers comprehensively, and assessing the impact of each method on the results is essential. Moreover, documenting the steps taken to manage outliers is important for reproducibility and to provide clarity on the decision-making process.
纠正预处理期间的错误是为机器学习准备数据的重要阶段。由于不同的原因,可能会出现错误原因包括数据输入错误、测量差异、传感器不准确或传输故障。纠正数据错误对于确保机器学习模型基于可靠且精确的数据进行训练至关重要,从而提高预测的准确性和可靠性。
Rectifying errors during preprocessing stands as a vital stage in readying data for ML. Errors may manifest due to diverse reasons such as data entry blunders, measurement discrepancies, sensor inaccuracies, or transmission glitches. Correcting errors in data holds paramount significance in guaranteeing that ML models are trained on dependable and precise data, consequently enhancing the accuracy and reliability of predictions.
有多种技术可以纠正数据错误。以下是一些广泛使用的方法:
Several techniques exist to rectify errors in data. Here are some widely utilized methods:
选择技术取决于数据性质、数据集大小等因素e,以及资源你的处置。
Choosing a technique hinges on factors such as the nature of the data, the dataset’s size, and the resources at your disposal.
特征选择涉及从数据集中选择最相关的特征来构建机器学习模型。这目标是在不显着影响模型准确性的情况下减少特征数量,从而提高性能、加快训练速度并更直接地解释模型。
Feature selection involves choosing the most pertinent features from a dataset for constructing a ML model. The objective is to decrease the number of features without substantially compromising the model’s accuracy, resulting in enhanced performance, quicker training, and a more straightforward interpretation of the model.
Several approaches to feature selection exist. Let’s take a look.
这些技术采用对特征进行排序的统计方法根据它们与目标变量的相关性。常用方法包括卡方、互信息和相关系数。随后根据预定义的阈值选择特征。
These techniques employ statistical methods to rank features according to their correlation with the target variable. Common methods encompass chi-squared, mutual information, and correlation coefficients. Features are subsequently chosen based on a predefined threshold.
卡方检验是机器学习中广泛采用的一种统计方法,用于特征选择,特别是对于分类变量有效。该测试衡量两个随机变量之间的依赖性,提供一个 P 值,该值表示获得与极端或更极端结果相同的可能性比实际观察的结果。
The chi-squared test is a widely employed statistical method in ML for feature selection that’s particularly effective for categorical variables. This test gauges the dependence between two random variables, providing a P-value that signifies the likelihood of obtaining a result as extreme as or more extreme than the actual observations.
在假设检验中,卡方检验评估收集的数据是否与预期数据一致。小的卡方检验统计量表示稳健匹配,而大的统计量表示弱匹配。P 值小于或等于 0.05 会导致拒绝原假设,认为其可能性极小。相反,P 值大于 0.05 会导致接受或“未能拒绝”原假设。当 P 值徘徊在 0.05 左右时,需要对假设进行进一步审查。
In hypothesis testing, the chi-squared test assesses whether the collected data aligns with the expected data. A small chi-squared test statistic indicates a robust match, while a large statistic implies a weak match. A P-value less than or equal to 0.05 leads to the rejection of the null hypothesis, considering it highly improbable. Conversely, a P-value greater than 0.05 results in accepting or “failing to reject” the null hypothesis. When the P-value hovers around 0.05, further scrutiny of the hypothesis is warranted.
在特征选择中,卡方检验评估数据集中每个特征与目标变量之间的关系。它根据特征的观察频率和预期频率之间是否存在统计上的显着差异来确定显着性,假设特征和目标之间独立。卡方得分高的特征对目标变量表现出更强的依赖性,使它们能够为分类或回归任务提供更多信息。卡方计算公式如下:
In feature selection, the chi-squared test evaluates the relationship between each feature and the target variable in the dataset. It determines significance based on whether a statistically significant difference exists between the observed and expected frequencies of the feature, assuming independence between the feature and target. Features with a high chi-squared score exhibit a stronger dependence on the target variable, making them more informative for classification or regression tasks. The formula for calculating the chi-squared is presented in the following equation:
在此等式中,代表观测值,代表期望值。计算过程包括求出观测频率与预期频率之间的差值,对结果进行平方,然后除以预期频率。将所有特征类别的这些值相加得出总体该特征的卡方统计量。
In this equation, represents the observed value and represents the expected value. The computation involves finding the difference between the observed frequency and the expected frequency, squaring the result, and then dividing by the expected frequency. The summation of these values across all categories of the feature yields the overall chi-squared statistic for that feature.
测试的自由度取决于特征中的类别数量和目标变量中的类别数量。
The degrees of freedom for the test relies on the number of categories in the feature and the number of categories in the target variable.
一个典型的应用卡方特征选择在于文本分类,特别是在文档中是否存在特定单词作为特征的情况下。卡方检验有助于识别与特定文档类别或类别密切相关的单词,随后将其用作 ML 模型中的特征。在分类数据中,特别是在特征与目标变量之间的关系是非线性的情况下,卡方被证明是一种有价值的特征选择方法。然而,它对于连续或高度相关的特征的适用性会降低,而替代的特征选择方法可能更合适。
An exemplary application of chi-squared feature selection lies in text classification, particularly in scenarios where the presence or absence of specific words in a document serves as features. The chi-squared test helps identify words strongly associated with a particular class or category of documents, subsequently enabling their use as features in a ML model. In categorical data, especially where the relationship between features and the target variable is non-linear, chi-squared proves to be a valuable method for feature selection. However, its suitability diminishes for continuous or highly correlated features, where alternative feature selection methods may be more fitting.
互信息表现为衡量两个随机变量相互依赖性的指标。在特征选择的背景下,它量化特征提供的有关目标变量的信息。核心方法需要计算每个特征与目标变量之间的互信息,最终选择互信息得分最高的特征。
Mutual information acts as a metric to gauge the interdependence of two random variables. In the context of feature selection, it quantifies the information a feature provides about the target variable. The core methodology entails calculating the mutual information between each feature and the target variable, ultimately selecting features with the highest mutual information scores.
从数学上讲,两个离散随机变量X和Y之间的互信息可以定义如下:
Mathematically, the mutual information between two discrete random variables, X and Y, can be defined as follows:
在给定方程中,p(x, y)表示X和Y的联合概率质量函数,而p(x)和p(y)分别表示X和Y的边际概率质量函数。
In the given equation, p(x, y) represents the joint probability mass function of X and Y, while p(x) and p(y) denote the marginal probability mass functions of X and Y, respectively.
在特征选择的背景下,互信息计算涉及将特征视为X和目标变量为Y。通过计算每个特征的互信息得分,我们可以选择得分最高的特征。
In the context of feature selection, mutual information calculation involves treating the feature as X and the target variable as Y. By computing the mutual information score for each feature, we can then select features with the highest scores.
为了估计计算互信息所需的概率质量函数,可以采用基于直方图的方法。这涉及将每个变量的范围划分为固定数量的箱,并根据每个箱中的观测频率估计概率质量函数。或者,可以利用核密度估计来估计概率密度函数,然后可以基于估计的密度来计算互信息。
To estimate the probability mass functions needed for calculating mutual information, histogram-based methods can be employed. This involves dividing the range of each variable into a fixed number of bins and estimating the probability mass functions based on the frequencies of observations in each bin. Alternatively, kernel density estimation can be utilized to estimate the probability density functions, and mutual information can then be computed based on the estimated densities.
在实际应用中,互信息通常与其他特征选择方法(例如卡方或基于相关的方法)一起使用,以提高特征选择过程的整体性能。
In practical applications, mutual information is often employed alongside other feature selection methods, such as chi-squared or correlation-based methods, to enhance the overall performance of the feature selection process.
相关系数为两个变量之间线性关系的强度和方向的指标。在特征选择领域,这些系数在识别与目标变量高度相关的特征方面非常有用,因此可以作为潜在有价值的预测因子。
Correlation coefficients serve as indicators of the strength and direction of the linear relationship between two variables. In the realm of feature selection, these coefficients prove useful in identifying features highly correlated with the target variable, thus serving as potentially valuable predictors.
用于特征选择的普遍相关系数是皮尔逊相关系数,也称为皮尔逊的r。Pearson's r 衡量两个连续变量之间的线性关系,范围从 -1(表示完全负相关)到 1(表示完全正相关),0 表示不相关。其计算涉及将两个变量之间的协方差除以它们的标准差的乘积,如下式所示:
The prevalent correlation coefficient employed for feature selection is the Pearson correlation coefficient, also referred to as Pearson’s r. Pearson’s r measures the linear relationship between two continuous variables, ranging from -1 (indicating a perfect negative correlation) to 1 (indicating a perfect positive correlation), with 0 denoting no correlation. Its calculation involves dividing the covariance between the two variables by the product of their standard deviations, as depicted in the following equation:
在给定的方程中,X和Y表示两个感兴趣的变量,cov()表示协方差函数,std()表示标准差函数。
In the given equation, X and Y represent the two variables of interest, cov() denotes the covariance function, and std() represents the standard deviation function.
利用 Pearson 的r进行特征选择涉及计算每个特征与目标变量之间的相关性。然后选择具有最高绝对相关系数的特征。绝对相关系数高表示与目标变量有很强的相关性,无论是正相关还是负相关。表 3.1概述了 Pearson 相关值及其相关程度的解释:
Utilizing Pearson’s r for feature selection involves computing the correlation between each feature and the target variable. Features with the highest absolute correlation coefficients are then selected. A high absolute correlation coefficient signifies a strong correlation with the target variable, whether positive or negative. The interpretation of Pearson correlation values and their degree of correlation is outlined in Table 3.1:
|
皮尔逊相关值 Pearson Correlation Value |
相关度 Degree of Correlation |
|
±1 ± 1 |
完美的 Perfect |
|
±0.50-±1 ± 0.50 - ± 1 |
高学历 High degree |
|
±0.30-± 0.49 ± 0.30 - ± 0.49 |
适度 Moderate degree |
|
< + 0.29 < +0.29 |
低度 Low degree |
|
0 0 |
没有相关性 No correlation |
表 3 .1 – Pearson 相关值及其相关程度
Table 3 .1 – Pearson correlation values and their degree of correlation
值得注意的是,Pearson r仅适用于识别变量之间的线性关系。如果关系是非线性的,或者如果一个或两个变量是分类变量,则其他Spearman或 Kendall等相关系数可能更合适。此外,在解释相关系数时务必谨慎,因为高相关性并不一定意味着因果关系。
It’s worth noting that Pearson’s r is only appropriate for identifying linear relationships between variables. If the relationship is nonlinear, or if one or both of the variables are categorical, other correlation coefficients such as Spearman’s or Kendall’s may be more appropriate. Additionally, it is important to be cautious when interpreting correlation coefficients as a high correlation does not necessarily imply causation.
这些技术深入研究通过迭代模型训练和测试来获得特征子集。众所周知的方法包括前向选择、后向消除和递归特征消除。虽然计算要求较高,但这些方法有可能显着提高模型的准确性。
These techniques delve into subsets of features through iterative model training and testing. Widely known methods encompass forward selection, backward elimination, and recursive feature elimination. While computationally demanding, these methods have the potential to significantly enhance model accuracy.
包装方法的一个具体例子是递归特征消除(RFE)。作为向后消除方法,RFE 系统地删除最不重要的内容直到剩余预定数量的特征。在每次迭代期间,机器学习模型都会在现有的基础上进行训练特征,最不重要的特征根据其特征重要性得分进行修剪。此顺序过程持续进行,直到获得指定数量的特征。特征重要性得分可以通过多种方法提取,包括来自线性模型的系数值或来自决策树的特征重要性得分。RFE 是一种计算成本较高的方法,但当特征数量非常大并且需要减少特征空间时,它会很有用。另一种方法是在训练过程中进行特征选择,这是通过嵌入方法完成的。
A concrete illustration of a wrapper method is recursive feature elimination (RFE). Functioning as a backward elimination approach, RFE systematically removes the least important feature until a predetermined number of features remains. During each iteration, a machine learning model is trained on the existing features, and the least important feature is pruned based on its feature importance score. This sequential process persists until the specified number of features is attained. The feature importance score can be extracted from diverse methods, including coefficient values from linear models or feature importance scores derived from decision trees. RFE is a computationally expensive method, but it can be useful when the number of features is very large and there is a need to reduce the feature space. An alternative approach is to have feature selection during the training process, something that’s done via embedding methods.
这些方法选择特征在模型的训练过程中。流行的方法包括 LASSO 和岭回归、决策树和随机森林。
These methods select features during the training process of the model. Popular methods include LASSO and ridge regression, decision trees, and random forests.
LASSO是最小绝对收缩和选择算子的缩写,是一种线性回归技术,通常用于机器学习中的特征选择。它是机制涉及向标准回归损失函数引入惩罚项。这种惩罚鼓励模型将不太重要的特征的系数降低到零,从而有效地消除他们来自模型。
LASSO, an acronym for Least Absolute Shrinkage and Selection Operator, serves as a linear regression technique that’s commonly employed for feature selection in machine learning. Its mechanism involves introducing a penalty term to the standard regression loss function. This penalty encourages the model to reduce the coefficients of less important features to zero, effectively eliminating them from the model.
事实证明,LASSO 方法在处理高维数据(特征数量远远超过样本数量)时特别有价值。在这种情况下,辨别预测目标变量的最关键特征可能具有挑战性。LASSO 通过自动识别最相关的特征,同时缩小其他特征的系数而脱颖而出。
The LASSO method proves especially valuable when grappling with high-dimensional data, where the number of features far exceeds the number of samples. In such scenarios, discerning the most crucial features for predicting the target variable can be challenging. LASSO comes to the fore by automatically identifying the most relevant features while simultaneously shrinking the coefficients of others.
LASSO 方法的工作原理是找到以下优化问题的解,这是一个最小化问题:
The LASSO method works by finding the solution for the following optimization problem, which is a minimization problem:
在给定的方程中,向量y表示目标变量,X表示特征矩阵,w表示回归系数向量,是一个指示惩罚项强度的超参数,代表系数的范数(即它们的绝对值之和)。
In the given equation, vector y represents the target variable, X denotes the feature matrix, w signifies the vector of regression coefficients, is a hyperparameter dictating the intensity of the penalty term, and stands for the norm of the coefficients (that is, the sum of their absolute values).
目标函数中包含惩罚项会促使模型精确地将某些系数归零,从本质上消除模型中的相关特征。惩罚强度的程度由超参数控制,可以通过使用交叉验证进行微调。
The inclusion of the penalty term in the objective function prompts the model to precisely zero out certain coefficients, essentially eliminating the associated features from the model. The degree of penalty strength is governed by the hyperparameter, which can be fine-tuned through the use of cross-validation.
与其他特征选择方法相比,LASSO 具有多个优点,例如其处理相关特征的能力以及同时执行特征选择和回归的能力。然而,LASSO 也有一些局限性,例如它倾向于从一组相关特征中只选择一个特征,并且如果特征数量远大于样本数量,其性能可能会下降。
LASSO has several advantages over other feature selection methods, such as its ability to handle correlated features and its ability to perform feature selection and regression simultaneously. However, LASSO has some limitations, such as its tendency to select only one feature from a group of correlated features, and its performance may deteriorate if the number of features is much larger than the number of samples.
考虑使用 LASSO 进行特征选择来预测房价。想象一个包含房屋详细信息的数据集——例如卧室数量、地块面积、建造年份等——以及它们各自的销售价格。使用 LASSO,我们可以查明最关键的特征来预测销售价格,同时将线性回归模型拟合到数据集。这结果是一个模型,可以根据新房的特征预测其售价。
Consider the application of LASSO for feature selection in predicting house prices. Imagine a dataset encompassing details about houses – such as the number of bedrooms, lot size, construction year, and so on – alongside their respective sale prices. Employing LASSO, we can pinpoint the most crucial features to predict the sale price while concurrently fitting a linear regression model to the dataset. The outcome is a model that’s ready to forecast the sale price of a new house based on its features.
岭回归,线性适用于特征选择的回归方法,与普通最小二乘回归非常相似,但在成本函数中引入了惩罚项以防止过度拟合。
Ridge regression, a linear regression method applicable to feature selection, closely resembles ordinary least squares regression but introduces a penalty term to the cost function to counter overfitting.
在岭回归中,成本函数通过包含与系数大小的平方成正比的惩罚项进行修改。该惩罚项由超参数调节,通常表示为或指示正则化强度。当设置为零时,岭回归恢复为普通最小二乘回归。
In ridge regression, the cost function undergoes modification with the inclusion of a penalty term directly proportional to the square of the coefficients’ magnitude. This penalty term is regulated by a hyperparameter, often denoted as or dictating the regularization strength. When is set to zero, ridge regression reverts to ordinary least squares regression.
惩罚项的影响体现在将系数的大小缩小到零。事实证明,这有助于减轻过度拟合,防止模型过度依赖任何单一特征。实际上,惩罚项通过降低不太相关的特征的重要性来充当特征选择的一种形式。
The penalty term’s impact manifests in shrinking the coefficients’ magnitude toward zero. This proves beneficial in mitigating overfitting, discouraging the model from excessively relying on any single feature. In effect, the penalty term acts as a form of feature selection by reducing the importance of less relevant features.
岭回归损失函数的方程如下:
The equation for the ridge regression loss function is as follows:
在这里,我们有以下内容:
Here, we have the following:
损失函数中的第一项测量预测值和真实值之间的均方误差。第二项是将系数缩小到零的惩罚项。岭回归算法找到回归系数的值最小化这个损失函数。通过调整正则化参数,我们可以控制偏差-方差权衡
模型的 alpha 值越高,正则化程度越高,过度拟合程度越低。
The first term in the loss function measures the mean squared error between the predicted values and the true values. The second term is the penalty term that shrinks the coefficients toward zero. The ridge regression algorithm finds the values of the regression coefficients that minimize this loss function. By tuning the regularization parameter, , we can control the bias-variance trade-off of the model, with higher alpha values leading to more regularization and lower overfitting.
通过检查模型产生的系数的大小,岭回归可用于特征选择。系数接近于零或更小的特征被认为不太重要,可以从模型中删除。可以使用交叉验证来调整的值,以找到模型复杂性和准确性之间的最佳平衡。
Ridge regression can be used for feature selection by examining the magnitudes of the coefficients produced by the model. Features with coefficients that are close to zero or smaller are considered less important and can be dropped from the model. The value of can be tuned using cross-validation to find the optimal balance between model complexity and accuracy.
岭回归的主要优点之一是它能够处理多重共线性,当自变量之间存在强相关性时就会出现多重共线性。在这种情况下,普通最小二乘回归可能会产生不稳定且不可靠的系数估计,但岭回归可以帮助稳定估计并提高模型的整体性能。
One of the main advantages of ridge regression is its ability to handle multicollinearity, which occurs when there are strong correlations between the independent variables. In such cases, ordinary least squares regression can produce unstable and unreliable coefficient estimates, but ridge regression can help stabilize the estimates and improve the overall performance of the model.
岭回归和 LASSO都是线性回归中使用的正则化技术,以防止模型过度拟合惩罚模型的系数。虽然这两种方法都试图防止过度拟合,但它们在惩罚系数的方法上有所不同。
Ridge regression and LASSO are both regularization techniques that are used in linear regression to prevent overfitting of the model by penalizing the model’s coefficients. While both methods seek to prevent overfitting, they differ in their approach to how the coefficients are penalized.
岭回归在误差平方和( SSE )中添加了一个惩罚项,该惩罚项与系数大小的平方成正比。惩罚项由正则化参数 ( ) 控制,该参数确定应用于系数的收缩量。该惩罚项将系数值缩小到零,但不会将它们精确设置为零。因此,岭回归可以用来减少模型中不相关特征的影响,但并不能完全消除它们。
Ridge regression adds a penalty term to the sum of squared errors (SSE) that is proportional to the square of the magnitude of the coefficients. The penalty term is controlled by a regularization parameter (), which determines the amount of shrinkage applied to the coefficients. This penalty term shrinks the values of the coefficients toward zero but does not set them exactly to zero. Therefore, ridge regression can be used to reduce the impact of irrelevant features in a model, but it will not eliminate them completely.
另一方面,LASSO还为SSE添加了惩罚项,但惩罚项与系数的绝对值成正比。与岭一样,LASSO 也有一个正则化参数 ( ),用于确定应用于系数的收缩量。然而,LASSO 具有一个独特的属性,即当正则化参数足够高时,会将某些系数精确设置为零。因此,LASSO可以用于特征选择,因为它可以消除不相关的特征并将其对应的系数设置为零。
On the other hand, LASSO also adds a penalty term to the SSE, but the penalty term is proportional to the absolute value of the coefficients. Like ridge, LASSO also has a regularization parameter () that determines the amount of shrinkage applied to the coefficients. However, LASSO has a unique property of setting some of the coefficients exactly to zero when the regularization parameter is sufficiently high. Therefore, LASSO can be used for feature selection as it can eliminate irrelevant features and set their corresponding coefficients to zero.
一般来说,如果数据集有很多特征并且其中少数特征预计很重要,LASSO 回归是更好的选择,因为它将不相关特征的系数设置为零,从而形成更简单且更可解释的模型。另一方面,如果预计数据集中的大多数特征都是相关的,则岭回归是更好的选择,因为它将系数缩小到零,但不会将它们精确设置为零,从而保留模型中的所有特征。
In general, if the dataset has many features and a small number of them are expected to be important, LASSO regression is a better choice as it will set the coefficients of irrelevant features to zero, leading to a simpler and more interpretable model. On the other hand, if most of the features in the dataset are expected to be relevant, ridge regression is a better choice as it will shrink the coefficients toward zero but not set them exactly to zero, preserving all the features in the model.
然而,值得注意的是,ridge 和 LASSO 之间的最佳选择取决于具体问题和数据集,通常建议尝试两者并使用交叉验证来比较它们的性能离子技术。
However, it is important to note that the optimal choice between ridge and LASSO depends on the specific problem and dataset, and it is often recommended to try both and compare their performance using cross-validation techniques.
这些方法将特征转化为低维空间,同时保留尽可能多的信息。受欢迎的方法包括 PCA、线性判别分析( LDA )和t- SNE。
These methods transform the features into a lower-dimensional space while retaining as much information as possible. Popular methods include PCA, linear discriminant analysis (LDA), and t-SNE.
PCA是一种广泛使用的技术在机器学习中,用于降低大型数据集的维度,同时保留大部分重要信息。PCA的基本思想是对一组将相关变量分解为一组称为主成分的不相关变量。
PCA is a widely used technique in machine learning for reducing the dimensionality of large datasets while retaining most of the important information. The basic idea of PCA is to transform a set of correlated variables into a set of uncorrelated variables known as principal components.
PCA的目标是识别数据中最大方差的方向,并将数据投影到这些方向,从而降低数据的维度。主成分按其解释的方差量的顺序排序,第一个主成分解释数据中的最大方差。
The goal of PCA is to identify the directions of maximum variance in the data and project the data in these directions, reducing the dimensionality of the data. The principal components are sorted in order of the amount of variance they explain, with the first principal component explaining the most variance in the data.
PCA算法包括以下步骤:
The PCA algorithm involves the following steps:
PCA可用于特征通过选择解释数据中最大方差的前k 个主成分来进行选择。这对于降低高维数据集的维数和提高机器学习模型的性能非常有用。然而,值得注意的是,PCA 可能并不总是能带来性能的提高,特别是在数据已经是低维的或者特征不高度相关的情况下。考虑所选主成分的可解释性也很重要,因为它们可能并不总是与数据中有意义的特征相对应。
PCA can be used for feature selection by selecting the top k principal components that explain the most variance in the data. This can be useful for reducing the dimensionality of high-dimensional datasets and improving the performance of machine learning models. However, it’s important to note that PCA may not always lead to improved performance, especially if the data is already low-dimensional or if the features are not highly correlated. It’s also important to consider the interpretability of the selected principal components as they may not always correspond to meaningful features in the data.
LDA是一个维度还原技术用于机器学习中的特征选择。它通常用于分类任务,通过将特征转换为低维空间来减少特征数量,同时保留尽可能多的类别歧视信息。
LDA is a dimensionality reduction technique that’s used for feature selection in machine learning. It is often used in classification tasks to reduce the number of features by transforming them into a lower-dimensional space while retaining as much class-discriminatory information as possible.
在LDA中,目标是找到原始特征的线性组合,以最大化类之间的分离。LDA 的输入是带标签示例的数据集,其中每个示例都是具有相应类标签的特征向量。LDA的输出是一组原始特征的线性组合,可以用作机器学习模型中的新特征。
In LDA, the goal is to find a linear combination of the original features that maximizes the separation between classes. The input to LDA is a dataset of labeled examples, where each example is a feature vector with a corresponding class label. The output of LDA is a set of linear combinations of the original features, which can be used as new features in a machine learning model.
要执行 LDA,第一步是计算每个类的均值和协方差矩阵。然后根据类均值和协方差矩阵计算总体均值和协方差矩阵。目标是将数据投影到低维空间,同时仍然保留班级信息。这是通过找到协方差矩阵的特征向量和特征值,按特征值的降序对它们进行排序,并选择与k个最大特征值相对应的前k 个特征向量来实现的。选定的特征向量构成新特征空间的基础。
To perform LDA, the first step is to compute the mean and covariance matrix of each class. The overall mean and covariance matrix are then calculated from the class means and covariance matrices. The goal is to project the data onto a lower-dimensional space while still retaining the class information. This is achieved by finding the eigenvectors and eigenvalues of the covariance matrix, sorting them in descending order of the eigenvalues, and selecting the top k eigenvectors that correspond to the k largest eigenvalues. The selected eigenvectors form the basis for the new feature space.
LDA算法可以概括为以下步骤:
The LDA algorithm can be summarized in the following steps:
其中,是类内散布矩阵,是类间散布矩阵。
Here, is the within-class scatter matrix and is the between-class scatter matrix.
7. 选择特征值最高的前k个特征向量作为新的特征空间。
7. Select the top k eigenvectors with the highest eigenvalues as the new feature space.
当特征数量较多且示例数量较少时,LDA 特别有用。它可以用于各种应用,包括图像识别、语音识别、NLP。然而,它假设类呈正态分布并且类协方差矩阵相等,但实际情况可能并非总是如此。
LDA is particularly useful when the number of features is large and the number of examples is small. It can be used in a variety of applications, including image recognition, speech recognition, and NLP. However, it assumes that the classes are normally distributed and that the class covariance matrices are equal, which may not always be the case in practice.
t-SNE 是一个维度用于在低维空间中可视化高维数据的缩减技术,通常用于特征选择。它是由 Laurens van der Maaten 和 Geoffrey 开发的辛顿,2008 年。
t-SNE is a dimensionality reduction technique that’s used for visualizing high-dimensional data in a low-dimensional space, often used for feature selection. It was developed by Laurens van der Maaten and Geoffrey Hinton in 2008.
t-SNE 背后的基本思想是保留低维空间中数据点的成对相似性,而不是保留它们之间的距离。换句话说,它试图保留数据的局部结构,同时丢弃全局结构。这在高维数据难以可视化但数据点之间可能存在有意义的模式和关系的情况下非常有用。
The basic idea behind t-SNE is to preserve the pairwise similarities of data points in a low-dimensional space, as opposed to preserving the distances between them. In other words, it tries to retain the local structure of the data while discarding the global structure. This can be useful in situations where the high-dimensional data is difficult to visualize, but there may be meaningful patterns and relationships among the data points.
t-SNE 首先计算高维空间中每对数据点之间的成对相似度。相似度通常使用高斯核来测量,它为附近的点赋予较高的权重,为远处的点赋予较低的权重。然后使用 softmax 函数将相似度矩阵转换为概率分布。此分布用于创建低维空间,通常是 2D或 3D。
t-SNE starts by calculating the pairwise similarity between each pair of data points in the high-dimensional space. The similarity is usually measured using a Gaussian kernel, which gives higher weights to nearby points and lower weights to distant points. The similarity matrix is then converted into a probability distribution using a softmax function. This distribution is used to create a low-dimensional space, typically 2D or 3D.
在低维空间中,t-SNE 再次计算每对数据点之间的成对相似度,但这次使用学生的 t 分布而不是高斯分布。t 分布比高斯分布具有更重的尾部,这有助于更好地保留数据的局部结构。然后,t-SNE 调整低维空间中的点的位置,以最小化高维空间中的成对相似度与低维空间中的成对相似度之间的差异。
In the low-dimensional space, t-SNE again calculates the pairwise similarities between each pair of data points, but this time using a student’s t-distribution instead of a Gaussian distribution. The t-distribution has heavier tails than the Gaussian distribution, which helps to better preserve the local structure of the data. t-SNE then adjusts the position of the points in the low-dimensional space to minimize the difference between the pairwise similarities in the high-dimensional space and the pairwise similarities in the low-dimensional space.
t-SNE 是一种通过将高维数据还原为低维空间来可视化高维数据的强大技术。然而,它通常不用于特征选择,因为其主要目的是创建复杂数据集的可视化。
t-SNE is a powerful technique for visualizing high-dimensional data by reducing it to a low-dimensional space. However, it is not typically used for feature selection as its primary purpose is to create visualizations of complex datasets.
相反,t-SNE 可以是用于帮助识别具有相似特征的数据点簇,这可能有助于识别对特定任务很重要的一组功能。例如,假设您有一个客户人口统计数据和购买历史记录的数据集,并且您希望根据购买行为识别相似的客户组。您可以使用 t-SNE 将高维特征空间缩减为二维,然后在散点图上绘制结果数据点。通过检查该图,您也许能够识别具有相似购买行为的客户群,从而为您的特征选择过程提供信息。以下是 MNIST 数据集的 t-SNE 示例:
Instead, t-SNE can be used to help identify clusters of data points that share similar features, which may be useful in identifying groups of features that are important for a particular task. For example, suppose you have a dataset of customer demographics and purchase history, and you want to identify groups of customers that are similar based on their purchasing behavior. You could use t-SNE to reduce the high-dimensional feature space to two dimensions, and then plot the resulting data points on a scatter plot. By examining the plot, you might be able to identify clusters of customers with similar purchasing behavior, which could then inform your feature selection process. Here’s a sample t-SNE for the MNIST dataset:
图 3.1 – MNIST 数据集上的 t-SNE
Figure 3.1 – t-SNE on the MNIST dataset
值得注意的是,t-SNE 主要是一种可视化工具,不应用作特征选择的唯一方法。相反,它可以与其他技术结合使用,例如LDA 或 PCA,以更全面地了解数据的底层结构。
It’s worth noting that t-SNE is primarily a visualization tool and should not be used as the sole method for feature selection. Instead, it can be used in conjunction with other techniques, such as LDA or PCA, to gain a more complete understanding of the underlying structure of your data.
特征的选择选择方法取决于数据的性质、数据集的大小、模型的复杂性以及可用的计算资源。在特征选择后仔细评估模型的性能非常重要,以确保重要信息没有丢失。另一个重要的过程是特征工程,即转换或选择特征r 机器学习模型。
The choice of feature selection method depends on the nature of the data, the size of the dataset, the complexity of the model, and the computational resources available. It is important to carefully evaluate the performance of the model after feature selection to ensure that important information has not been lost. Another important process is feature engineering, which is about transforming or selecting features for the machine learning models.
特征工程是从原始数据中选择、转换和提取特征以提高机器学习模型性能的过程。特征是可用于进行预测或分类的数据的单独可测量属性或特征。
Feature engineering is the process of selecting, transforming, and extracting features from raw data to improve the performance of machine learning models. Features are the individual measurable properties or characteristics of the data that can be used to make predictions or classifications.
一种常见的技术特征工程是特征选择,涉及从原始数据集中选择相关特征的子集,以提高模型的准确性并降低其复杂性。这可以通过统计方法来完成,例如相关分析或使用决策树或随机森林的特征重要性排序。
One common technique in feature engineering is feature selection, which involves selecting a subset of relevant features from the original dataset to improve the model’s accuracy and reduce its complexity. This can be done through statistical methods such as correlation analysis or feature importance ranking using decision trees or random forests.
特征工程中的另一种技术是特征提取,它涉及将原始数据转换为一组可能对模型更有用的新特征。特征选择和特征工程之间的主要区别在于它们的方法:特征选择保留原始特征的子集而不修改所选特征,而特征工程算法重新配置数据并将数据转换到新的特征空间。特征工程可以通过降维、PCA 或 t-SNE 等技术来完成。特征选择和提取已在上一小节(3-1-3)中详细解释。
Another technique in feature engineering is feature extraction, which involves transforming the raw data into a new set of features that may be more useful for the model. The primary distinction between feature selection and feature engineering lies in their approaches: while feature selection retains a subset of the original features without modifying the selected features, feature engineering algorithms reconfigure and transform the data into a new feature space. Feature engineering can be done through techniques such as dimensionality reduction, PCA, or t-SNE. Feature selection and extraction were explained in detail in the previous subsection (3-1-3).
特征缩放是特征工程中的另一项重要技术,涉及将特征值缩放到相同范围,通常在 0 和 1 或 -1 和 1 之间。这样做是为了防止某些特征支配模型中的其他特征,并确保算法在训练过程中能够快速收敛。当数据集中的特征具有不同的尺度时,在使用某些对特征的相对大小敏感的机器学习算法时,这可能会导致问题。特征缩放可以通过确保所有特征都处于相似的尺度来帮助解决这个问题。特征缩放的常用方法包括最小-最大缩放、Z分数缩放和最大绝对值缩放。
Feature scaling is another important technique in feature engineering that involves scaling the values of features to the same range, typically between 0 and 1 or -1 and 1. This is done to prevent certain features from dominating others in the model and to ensure that the algorithm can converge quickly during training. When the features in the dataset have different scales, this can lead to issues when using certain machine learning algorithms that are sensitive to the relative magnitudes of the features. Feature scaling can help to address this problem by ensuring that all features are on a similar scale. Common methods for feature scaling include min-max scaling, Z-score scaling, and scaling by the maximum absolute value.
特征缩放的常用方法有以下几种:
There are several common methods for feature scaling:
这里,x是原始特征值,min(x)是特征的最小值,max(x)是特征的最大值。
Here, x is the original feature value, min(x) is the minimum value of the feature, and max(x) is the maximum value of the feature.
这里,x是原始特征值,mean(x)是特征的均值,std(x)是特征的标准差。
Here, x is the original feature value, mean(x) is the mean of the feature, and std(x) is the standard deviation of the feature.
这里,x是原始特征值,median(x)是特征的中位数,Q1(x)是特征的第一个四分位数特征,Q3(x)是特征的第三个四分位数。
Here, x is the original feature value, median(x) is the median of the feature, Q1(x) is the first quartile of the feature, and Q3(x) is the third quartile of the feature.
这里,x是原始特征值。
Here, x is the original feature value.
这里,x是原始特征值,是使用最大似然估计的功率参数。
Here, x is the original feature value, and is the power parameter that is estimated using maximum likelihood.
这些是机器学习中特征缩放的一些最常见的方法。方法的选择取决于数据的分布、所使用的机器学习算法以及问题的具体要求。
These are some of the most common methods for feature scaling in machine learning. The choice of method depends on the distribution of the data, the machine learning algorithm being used, and the specific requirements of the problem.
特征工程中的最后一项技术是特征构建,它涉及通过组合或转换现有特征来创建新特征。这可以通过多项式展开、对数变换或交互项等技术来完成。
One final technique in feature engineering is feature construction, which involves creating new features by combining or transforming existing ones. This can be done through techniques such as polynomial expansion, logarithmic transformation, or interaction terms.
多项式展开式是特征构建技术,涉及通过现有特征的多项式组合来创建新特征。该技术通常用于机器学习中,以对特征与目标变量之间的非线性关系进行建模。
Polynomial expansion is a feature construction technique that involves creating new features by taking polynomial combinations of existing features. This technique is commonly used in machine learning to model nonlinear relationships between features and the target variable.
多项式背后的思想扩展是通过将现有功能提升到不同的幂并利用其产品来创建新功能。例如,假设我们有一个特征x。我们可以通过x ( )的平方来创建新特征。我们还可以通过对x进行更高次幂来创建高阶多项式特征,例如、等。一般来说,我们可以通过采用原始特征的乘积和幂的所有可能组合来创建d次多项式特征,直到d次。
The idea behind polynomial expansion is to create new features by raising the existing features to different powers and taking their products. For example, suppose we have a single feature, x. We can create new features by taking the square of x (). We can also create higher-order polynomial features by taking x to even higher powers, such as , , and so on. In general, we can create polynomial features of degree d by taking all possible combinations of products and powers of the original features up to degree d.
除了从单个特征创建多项式特征之外,我们还可以从多个特征创建多项式特征。例如,假设我们有两个特征和。我们可以通过取它们的乘积 ( ) 并将它们提高到不同的幂 ( 、等)来创建新的多项式特征。同样,我们可以通过采用原始特征的乘积和幂的所有可能组合来创建任意次数的多项式特征。
In addition to creating polynomial features from a single feature, we can also create polynomial features from multiple features. For example, suppose we have two features, and . We can create new polynomial features by taking their products () and raising them to different powers ( , , and so on). Again, we can create polynomial features of any degree by taking all possible combinations of products and powers of the original features.
使用多项式展开时的一个重要考虑因素是它可以快速产生大量特征,尤其是对于高次多项式。这可能会使生成的模型更加复杂且难以解释,如果特征数量控制不当,还可能导致过度拟合。为了解决这个问题,通常使用正则化技术或特征选择方法来选择信息最丰富的多项式特征的子集。
One important consideration when using polynomial expansion is that it can quickly lead to a large number of features, especially for high degrees of polynomials. This can make the resulting model more complex and harder to interpret, and can also lead to overfitting if the number of features is not properly controlled. To address this issue, it is common to use regularization techniques or feature selection methods to select a subset of the most informative polynomial features.
总的来说,多项式展开是一种强大的特征构建技术,可以帮助捕获特征和目标变量之间复杂的非线性关系。但是,应谨慎使用并进行适当的正则化或特征选择,以避免过度拟合并保持模型的可解释性。
Overall, polynomial expansion is a powerful feature construction technique that can help capture complex nonlinear relationships between features and the target variable. However, it should be used with caution and with appropriate regularization or feature selection to avoid overfitting and maintain model interpretability.
例如,在回归问题中,您可能有一个具有单个特征(例如x )的数据集,并且您想要拟合一个可以捕获x和目标变量y之间关系的模型。然而, x和y之间的关系可能不是线性的,简单的线性模型可能还不够。在这种情况下,多项式展开可用于创建捕获x和y之间非线性关系的附加特征。
For example, in a regression problem, you might have a dataset with a single feature, say x, and you want to fit a model that can capture the relationship between x and the target variable, y. However, the relationship between x and y may not be linear, and a simple linear model may not be sufficient. In this case, polynomial expansion can be used to create additional features that capture the non-linear relationship between x and y.
为了说明这一点,假设您有一个包含单个特征x和目标变量y的数据集,并且您想要拟合多项式回归模型。目标是找到一个函数f(x),最小化预测之间的差异和y的实际值。
To illustrate, let’s say you have a dataset with a single feature, x, and a target variable, y, and you want to fit a polynomial regression model. The goal is to find a function, f(x), that minimizes the difference between the predicted and actual values of y.
多项式展开可用于创建基于x 的附加特征,例如、等。这可以使用scikit-learn等库来完成,它具有PolynomialFeatures函数,可以自动生成指定次数的多项式特征。
Polynomial expansion can be used to create additional features based on x, such as , , and so on. This can be done using libraries such as scikit-learn, which has a PolynomialFeatures function that can automatically generate polynomial features of a specified degree.
通过添加这些多项式特征,模型变得更具表现力,并且可以捕获x和y之间的非线性关系。然而,重要的是要小心,不要过度拟合数据,因为添加太多多项式特征可能会导致模型过于复杂,并且在新的、看不见的数据上表现不佳。
By adding these polynomial features, the model becomes more expressive and can capture the non-linear relationship between x and y. However, it’s important to be careful not to overfit the data as adding too many polynomial features can lead to a model that is overly complex and performs poorly on new, unseen data.
对数变换为数据预处理中使用的一种常见特征工程技术。对数变换的目标是通过对特征应用对数函数来减少数据的倾斜并使其更加对称。此技术对于倾斜的特征特别有用,例如具有高值长尾的特征。
Logarithmic transformation is a common feature engineering technique that’s used in data preprocessing. The goal of logarithmic transformation is to make data less skewed and more symmetric by applying a logarithmic function to the features. This technique can be particularly useful for features that are skewed, such as those with a long tail of high values.
对数变换定义为采用数据自然对数的方程:
The logarithmic transformation is defined as an equation taking the natural logarithm of the data:
这里,y是变换后的数据,x是原始数据。对数函数将原始数据映射到新空间,其中值之间的关系被保留,但尺度被压缩。对数变换对于范围较大或呈指数分布的特征特别有用,例如价格产品或个人收入。
Here, y is the transformed data and x is the original data. The logarithmic function maps the original data to a new space, where the relationship between the values is preserved but the scale is compressed. The logarithmic transformation is particularly useful for features with large ranges or that are distributed exponentially, such as the prices of products or the incomes of individuals.
对数变换的好处之一是它可以帮助规范化数据,使其更适合某些假设数据呈正态分布的机器学习算法。此外,对数变换可以减少异常值对数据的影响,有助于提高某些模型的性能。
One of the benefits of the logarithmic transformation is that it can help normalize data and make it more suitable for certain machine learning algorithms that assume normally distributed data. Additionally, logarithmic transformation can reduce the impact of outliers on the data, which can help improve the performance of some models.
需要注意的是,对数变换并不适合所有类型的数据。例如,如果数据包含零或负值,则不能直接应用对数变换。在这些情况下,可以使用修改的对数变换,例如在取对数之前添加一个常数。总的来说,对数变换对于特征工程来说是一种有用的技术,可以帮助提高机器学习模型的性能,特别是在处理倾斜或指数分布的数据时。
It’s important to note that the logarithmic transformation is not appropriate for all types of data. For example, if the data includes zero or negative values, the logarithmic transformation cannot be applied directly. In these cases, a modified logarithmic transformation, such as adding a constant before taking the logarithm, may be used. Overall, logarithmic transformation is a useful technique for feature engineering that can help improve the performance of machine learning models, especially when dealing with skewed or exponentially distributed data.
总之,特征工程是机器学习管道中的关键步骤,因为它可以显着影响结果模型的性能和可解释性。有效的特征工程需要领域知识、创造力以及测试和完善不同技术的迭代过程,直到确定最佳特征集。
In summary, feature engineering is a critical step in the machine learning pipeline as it can significantly impact the performance and interpretability of the resulting models. Effective feature engineering requires domain knowledge, creativity, and an iterative process of testing and refining different techniques until the optimal set of features is identified.
在特征构建、交互方面术语是指通过乘法、除法或其他数学运算组合数据集中的两个或多个现有特征来创建新特征。这些新功能捕捉互动或原始特征之间的关系,它们可以帮助提高机器学习模型的准确性。
In feature construction, interaction terms refer to creating new features by combining two or more existing features in a dataset through multiplication, division, or other mathematical operations. These new features capture the interaction or relationship between the original features, and they can help improve the accuracy of machine learning models.
例如,在房地产价格数据集中,您可能具有卧室数量、浴室数量和房产面积等特征。这些特征本身提供了一些有关房产价格的信息,但它们没有捕获特征之间的任何交互作用。但是,通过在卧室数量和平方英尺之间创建交互项,您可以得出这样的想法:拥有更多卧室的较大房产往往比拥有相同卧室数量的较小房产更贵。
For example, in a dataset of real estate prices, you might have features such as the number of bedrooms, the number of bathrooms, and the square footage of the property. By themselves, these features provide some information about the price of the property, but they do not capture any interaction effects between the features. However, by creating an interaction term between the number of bedrooms and the square footage, you can capture the idea that larger properties with more bedrooms tend to be more expensive than smaller ones with the same number of bedrooms.
在实践中,交互项是通过将两个或多个特征相乘或相除而创建的。例如,如果我们有两个特征x和y,我们可以通过乘以创建一个交互项他们在一起:xy。我们还可以创建将一个特征除以另一个特征的交互项:x/y。
In practice, interaction terms are created by multiplying or dividing two or more features together. For example, if we have two features, x and y, we can create an interaction term by multiplying them together: xy. We can also create interaction terms by dividing one feature by another: x/y.
创建交互术语时,重要的是要考虑要组合哪些功能以及如何组合它们。以下是一些常用技术:
When creating interaction terms, it is important to consider which features to combine and how to combine them. Here are some common techniques:
总体而言,交互项是特征构建中的强大工具,可以帮助捕获特征之间的复杂关系并提高机器学习模型的准确性。然而,在创建交互术语时要小心,因为太多或选择不当的术语可能会导致错误,这一点很重要。导致过度拟合或模型可解释性降低。
Overall, interaction terms are a powerful tool in feature construction that can help capture complex relationships between features and improve the accuracy of machine learning models. However, it is important to be careful when creating interaction terms as too many or poorly chosen terms can lead to overfitting or decreased model interpretability.
在这里,我们将解释一些最常见的机器学习模型,以及它们的优点和缺点。了解这些信息将帮助您选择最适合的型号问题并能够改进实施的模型。
Here, we will explain some of the most common machine learning models, as well as their advantages and disadvantages. Knowing this information will help you pick the best model for the problem and be able to improve the implemented model.
线性回归是一种监督学习算法,用于对因变量与一个或多个自变量之间的关系进行建模。它假设线性关系输入特征和输出。线性回归的目标是找到基于自变量预测因变量值的最佳拟合线。
Linear regression is a type of supervised learning algorithm that’s used to model the relationship between a dependent variable and one or more independent variables. It assumes a linear relationship between the input features and the output. The goal of linear regression is to find the best-fit line that predicts the value of the dependent variable based on the independent variables.
The equation for a simple linear regression with one independent variable (also called a simple linear equation) is as follows:
在这里,我们有以下内容:
Here, we have the following:
线性回归的目标是找到使因变量的预测值与实际值之间的差异最小化的m和b值。这种差异通常使用成本函数来测量,例如均方误差或平均绝对误差。
The goal of linear regression is to find the values of m and b that minimize the difference between the predicted values and the actual values of the dependent variable. This difference is typically measured using a cost function, such as mean squared error or mean absolute error.
多元线性回归是简单线性回归的扩展,其中有多个自变量。多元线性回归方程如下所示:
Multiple linear regression is an extension of simple linear regression, where there are multiple independent variables. The equation for multiple linear regression is shown here:
这里我们有以下内容:
Here we have the following:
与简单线性回归类似,多元线性回归的目标是找到使预测值与实际值之间的差异最小化的值因变量。
Similar to simple linear regression, the goal of multiple linear regression is to find the values of that minimize the difference between the predicted values and the actual values of the dependent variable.
The advantages of linear regression are as follows:
The disadvantages of linear regression are as follows:
逻辑回归是一种流行的机器学习算法,用于解决分类问题。不像线性回归用于预测连续值,逻辑回归用于预测离散结果,通常是二元结果(0或 1)。
Logistic regression is a popular machine learning algorithm that’s used for classification problems. Unlike linear regression, which is used for predicting continuous values, logistic regression is used for predicting discrete outcomes, typically binary outcomes (0 or 1).
逻辑回归的目标是根据一个或多个输入变量估计某一结果的概率。逻辑回归的输出是概率得分,可以是通过应用阈值将其转换为二进制类标签。可以根据问题的具体要求调整阈值以平衡精度和召回率。
The goal of logistic regression is to estimate the probability of a certain outcome based on one or more input variables. The output of logistic regression is a probability score, which can be converted into a binary class label by applying a threshold value. The threshold value can be adjusted to balance between precision and recall based on the specific requirements of the problem.
逻辑回归模型假设输入变量和输出变量之间的关系在 logit(对数赔率)空间中呈线性关系。logit 函数定义如下:
The logistic regression model assumes that the relationship between the input variables and the output variable is linear in the logit (log odds) space. The logit function is defined as follows:
这里,p是正结果的概率(即事件发生的概率)。
Here, p is the probability of the positive outcome (that is, the probability of the event occurring).
逻辑回归模型可以在数学上表示如下:
The logistic regression model can be represented mathematically as follows:
这里,是模型的系数,是输入变量,logit(p)是积极结果概率的 logit 函数。
Here, are the coefficients of the model, are the input variables, and logit(p) is the logit function of the probability of a positive outcome.
逻辑回归模型使用标记示例数据集进行训练,其中每个示例由一组输入变量和一个指示积极结果是否发生的二进制标签组成。使用最大似然估计来估计模型的系数,最大似然估计旨在找到使观测数据的似然最大化的系数值。
The logistic regression model is trained using a dataset of labeled examples, where each example consists of a set of input variables and a binary label indicating whether the positive outcome occurred or not. The coefficients of the model are estimated using maximum likelihood estimation, which seeks to find the values of the coefficients that maximize the likelihood of the observed data.
逻辑回归的优点如下:
The advantages of logistic regression are as follows:
逻辑回归的缺点如下:
The disadvantages of logistic regression are as follows:
决策树是一种类型用于分类和的监督学习算法回归分析。决策树由一系列代表决策点的节点组成,每个节点都有一个或多个分支,这些分支通向其他决策点或最终预测。
Decision trees are a type of supervised learning algorithm used for classification and regression analysis. A decision tree consists of a series of nodes that represent decision points, each of which has one or more branches that lead to other decision points or a final prediction.
在分类问题中,树的每个叶节点代表一个类标签,而在回归问题中,每个叶节点代表一个数值。构建决策树的过程涉及选择一系列属性,这些属性序列能够最好地将数据分割成与目标变量更加同质的子集。这个过程是通常对每个子集递归重复,直到满足停止标准,例如每个子集中实例的最小数量或树的最大深度。
In a classification problem, each leaf node of the tree represents a class label, while in a regression problem, each leaf node represents a numerical value. The process of building a decision tree involves choosing a sequence of attributes that best splits the data into subsets that are more homogenous concerning the target variable. This process is typically repeated recursively for each subset until a stopping criterion is met, such as a minimum number of instances in each subset or a maximum depth of the tree.
决策树的方程涉及计算每个决策点的每个潜在分裂的信息增益(或另一个分裂标准,例如基尼杂质或熵)。这选择具有最高信息增益的属性作为该节点的划分标准。信息增益的概念公式如下所示:
The equations for decision trees involve calculating the information gain (or another splitting criterion, such as Gini impurity or entropy) for each potential split at each decision point. The attribute with the highest information gain is selected as the split criterion for that node. The conceptual formula for information gain is shown here:
在这里,熵是系统的杂质或随机性的度量。在决策树的背景下,熵用于测量树中节点的杂质度。
Here, entropy is a measure of the impurity or randomness of a system. In the context of decision trees, entropy is used to measure the impurity of a node in the tree.
节点的熵计算如下:
The entropy of a node is calculated as follows:
这里,c是类的数量,是节点中属于类i的样本的比例。
Here, c is the number of classes and is the proportion of the samples that belong to class i in the node.
节点的熵范围为0到1,0表示纯节点(即所有样本属于同一类),1表示节点均匀分布在所有类之间。
The entropy of a node ranges from 0 to 1, with 0 indicating a pure node (that is, all samples belong to the same class) and 1 indicating a node that is evenly split between all classes.
在决策树中,节点的熵用于确定树的分裂标准。其思想是将节点拆分为两个或多个子节点,使得子节点的熵低于父节点的熵。选择具有最低熵的分裂作为最佳分裂。
In a decision tree, the entropy of a node is used to determine the splitting criterion for the tree. The idea is to split the node into two or more child nodes such that the entropy of the child nodes is lower than the entropy of the parent node. The split with the lowest entropy is chosen as the best split.
请注意,决策树中下一个节点的选择因底层算法而异,例如 CART、ID3 或 C4.5。这里我们解释的是CART,它利用基尼不纯度和熵来分割数据。
Please note that the choice of the next node in the decision tree differs based on the underlying algorithm – for example, CART, ID3, or C4.5. What we explained here was CART, which uses Gini impurity and entropy to split the data.
使用的优点熵作为分裂准则的优点是它既可以处理二类分类问题,也可以处理多类分类问题。与相比,它的计算效率也相对较高其他分割标准。然而,使用熵的一个缺点是它往往会创建有利于具有多个类别的属性的偏差树。
The advantage of using entropy as a splitting criterion is that it can handle both binary and multi-class classification problems. It is also relatively computationally efficient compared to other splitting criteria. However, one disadvantage of using entropy is that it tends to create biased trees in favor of attributes with many categories.
以下是决策树的一些优点:
Here are some of the advantages of decision trees:
以下是决策树的一些缺点:
Here are some of the disadvantages of decision trees:
随机森林是一个集成学习方法,用途广泛,可以执行分类和回归任务。它的工作原理是在训练期间生成多个决策树,根据大多数树来预测分类的目标类别,并根据回归任务的树的平均预测来预测预测值。构建随机森林的算法可以总结为在以下步骤中:
Random forest is an ensemble learning method that’s versatile and can perform classification and regression tasks. It operates by generating multiple decision trees during training, predicting the target class for classification based on the majority of the trees, and the predicted value based on the mean prediction by trees for regression tasks. The algorithm for constructing a random forest can be summarized in the following steps:
随机森林算法可以用数学表达如下。
The random forest algorithm can be expressed mathematically as follows.
给定一个具有N 个样本和M 个特征的数据集D ,我们通过应用前面的步骤创建T 个决策树{ T re e 1 , T re e 2 , … , T re e T } 。每个决策树都是使用大小为N' (N' <= N)的数据引导样本D'和大小为m (m <= M)的特征子集F'构建的。对于决策树中的每个分割,我们从F'中随机选择k (k < m) 个特征,并根据杂质度量(例如基尼指数或熵)选择最佳特征来分割数据。构建决策树直到满足停止标准(例如,叶节点中的最大深度或最小样本数)。
Given a dataset, D, with N samples and M features, we create T decision trees {Tre e 1, Tre e 2, … , Tre e T} by applying the preceding steps. Each decision tree is constructed using a bootstrap sample of the data, D’, with size N’ (N’ <= N) and a subset of the features, F’, with size m (m <= M). For each split in the decision tree, we randomly select k (k < m) features from F’ and choose the best feature to split the data based on an impurity measure (for example, Gini index or entropy). The decision tree is built until a stopping criterion is met (for example, the maximum depth or minimum number of samples in a leaf node).
新样本x的最终预测是通过聚合所有决策树的预测获得的。
The final prediction, , for a new sample, x, is obtained by aggregating the predictions from all decision trees.
对于分类,是从所有决策树中获得最多投票的类别:
For classification, is the class that receives the most votes from all decision trees:
这里,是第 j 个决策树对第 i 个样本的预测,I()是指示函数,如果条件为真,则返回 1,否则返回 0。
Here, is the prediction of the j-th decision tree for the i-th sample, and I() is the indicator function that returns 1 if the condition is true and 0 otherwise.
For regression, is the average of the predictions from all decision trees:
这里,是新样本x的第 i 个决策树的预测。
Here, is the prediction of the i-th decision tree for the new sample, x.
综上所述,随机森林是一种强大的机器学习算法,可以处理高维和噪声数据集。它的工作原理是使用数据和特征子集的引导样本构建多个决策树,然后聚合所有决策树的预测以做出最终预测。该算法具有可扩展性、易于使用,并提供了特征重要性的衡量标准,使其成为许多机器学习应用程序的流行选择。
In summary, random forest is a powerful machine learning algorithm that can handle high-dimensional and noisy datasets. It works by constructing multiple decision trees using bootstrap samples of the data and feature subsets, and then aggregating the predictions of all decision trees to make a final prediction. The algorithm is scalable, easy to use, and provides a measure of feature importance, making it a popular choice for many machine learning applications.
The advantages of random forests are as follows:
The disadvantages of random forests are as follows:
总的来说,随机森林是一种强大的机器学习算法,有很多优点,但重要的是在将其应用于特定问题之前仔细考虑其局限性。
Overall, random forest is a powerful machine learning algorithm that has many advantages, but it is important to carefully consider its limitations before applying it to a particular problem.
考虑 SVM强大的监督学习算法,可以执行分类和回归任务。它们在具有复杂决策边界的场景中表现出色,超越了线性模型的局限性。SVM 的核心目标是在多维空间中识别一个超平面,最大限度地隔离类别。该超平面的定位是为了最大化其自身与每个类的最近点(称为支持向量)之间的距离。以下是 SVM 如何解决二元分类问题。给定一组训练数据 ,其中是 d 维特征向量,是二元类标签(+1 或 -1),SVM 的目标是找到一个以最大间隔分隔两个类的超平面。边距定义为超平面与每个类最近的数据点之间的距离:
SVMs are considered robust supervised learning algorithms that can perform both classification and regression tasks. They excel in scenarios with intricate decision boundaries, surpassing the limitations of linear models. At their core, SVMs aim to identify a hyperplane within a multi-dimensional space that maximally segregates the classes. This hyperplane is positioned to maximize the distance between itself and the closest points from each class, known as support vectors. Here’s how SVMs work for a binary classification problem. Given a set of training data, , where is a d-dimensional feature vector and is the binary class label (+1 or -1), the goal of an SVM is to find a hyperplane that separates the two classes with the largest margin. The margin is defined as the distance between the hyperplane and the closest data points from each class:
图 3.2 – SVM 边距
Figure 3.2 – SVM margins
超平面由权重向量w和偏置项b定义,这样对于任何新数据点x,预测的类标签y由以下等式给出:
The hyperplane is defined by a weight vector, w, and a bias term, b, such that for any new data point, x, the predicted class label, y, is given by the following equation:
+ b)
+b)
这里,sign是符号函数,如果参数为正,则返回 +1,否则返回 - 1。
Here, sign is the sign function, which returns +1 if the argument is positive and -1 otherwise.
SVM的目标函数是在边际最大化的约束下使分类误差最小化。这可以表述为一个优化问题:
The objective function of an SVM is to minimize the classification error subject to the constraint that the margin is maximized. This can be formulated as an optimization problem:
这里,是权重向量w的欧几里得范数平方。这些约束确保所有数据点都被正确分类并且裕度最大化。
Here, is the squared Euclidean norm of the weight vector, w. The constraints ensure that all data points are correctly classified and that the margin is maximized.
以下是SVM 的一些优点:
Here are some of the advantages of SVMs:
Here are some of the disadvantages of SVMs:
神经网络和 Transformer 都是强大的机器学习模型,可用于各种任务,例如图像分类、NLP 和语音识别。
Neural networks and transformers are both powerful machine learning models that are used for a variety of tasks, such as image classification, NLP, and speech recognition.
神经网络绘制灵感来自于人脑的结构和功能。它们代表了一类机器学习模型,精通分类、回归等各种任务。由多层互连的称为神经元的节点,这些网络熟练地处理和操纵数据。每层的输出都会输入到下一层,从而创建特征表示的层次结构。第一层的输入是原始数据,最后一层的输出是预测。图 3 .3显示了一个根据身高和体重检测人性别的简单神经网络:
Neural networks draw inspiration from the structure and functioning of the human brain. They represent a category of machine learning models that are proficient in various tasks such as classification, regression, and more. Comprising multiple layers of interconnected nodes known as neurons, these networks adeptly process and manipulate data. The output of each layer is fed into the next layer, creating a hierarchy of feature representations. The input to the first layer is the raw data, and the output of the final layer is the prediction. A simple neural network for detecting the gender of a person based on their height and weight is shown in Figure 3.3:
图 3.3 – 简单的神经网络
Figure 3.3 – Simple neural network
The operation of a single neuron in a neural network can be represented by the following equation:
这里,是输入值,是神经元之间连接的权重,b是偏置项,f是激活函数。激活函数对输入和偏差项的加权和应用非线性变换。
Here, is the input values, is the weights of the connections between the neurons, b is the bias term, and f is the activation function. The activation function applies a non-linear transformation to the weighted sum of the inputs and bias term.
训练神经网络涉及调整神经元的权重和偏差以最小化损失函数。这通常是使用优化算法(例如随机梯度下降)来完成的。
Training a neural network involves adjusting the weights and biases of the neurons to minimize a loss function. This is typically done using an optimization algorithm such as stochastic gradient descent.
神经网络的优点包括能够学习输入和输出数据之间复杂的非线性关系、自动从原始数据中提取有意义的特征的能力以及对大型数据集的可扩展性。
The advantages of neural networks include their ability to learn complex non-linear relationships between input and output data, their ability to automatically extract meaningful features from raw data, and their scalability to large datasets.
神经网络的缺点包括计算和内存要求高、对超参数调整的敏感性以及解释其内部表示的困难。
The disadvantages of neural networks include their high computational and memory requirements, their sensitivity to hyperparameter tuning, and the difficulty of interpreting their internal representations.
Transformer是一种神经网络架构特别适合文本或语音等序列数据。它们是在 NLP 背景下引入的,并已应用于广泛的任务中。
Transformers are a type of neural network architecture that is particularly well suited to sequential data such as text or speech. They were introduced in the context of NLP and have since been applied to a wide range of tasks.
Transformer 的核心组件是自注意力机制,它允许模型在计算输出时关注输入序列的不同部分。自注意力机制基于查询向量、一组键向量和一组值向量之间的点积。得到的注意力权重用于对值进行加权,然后将其组合以产生输出。
The core component of a transformer is the self-attention mechanism, which allows the model to attend to different parts of the input sequence when computing the output. The self-attention mechanism is based on a dot product between a query vector, a set of key vectors, and a set of value vectors. The resulting attention weights are used to weight the values, which are then combined to produce the output.
自注意力操作可以用以下等式表示:
The self-attention operation can be represented by the following equations:
这里,X是输入序列,、、 和分别是查询向量、键向量和值向量的学习投影矩阵,是键向量的维数,是学习投影矩阵,将注意力机制的输出映射到最终输出。
Here, X is the input sequence, , , and are learned projection matrices for the query, key, and value vectors, respectively, is the dimensionality of the key vectors, and is a learned projection matrix that maps the output of the attention mechanism to the final output.
Transformer 的优点包括处理可变长度输入序列的能力、捕获数据中的远程依赖性的能力以及在许多NLP 任务上的最先进的性能。
The advantages of transformers include their ability to handle variable-length input sequences, their ability to capture long-range dependencies in the data, and their state-of-the-art performance on many NLP tasks.
Transformer 的缺点包括高计算和内存要求、对超参数调整的敏感性以及难以处理需要显式动态建模的任务。
The disadvantages of transformers include their high computational and memory requirements, their sensitivity to hyperparameter tuning, and their difficulty in handling tasks that require explicit modeling of sequential dynamics.
这些只是一些最流行的机器学习模型。模型的选择取决于当前的问题、数据的大小和质量以及期望的结果。现在我们已经探索了最常见的机器学习模型,我们将l 解释训练过程中发生的模型欠拟合和过拟合。
These are just a few of the most popular machine learning models. The choice of model depends on the problem at hand, the size and quality of the data, and the desired outcome. Now that we have explored the most common machine learning models, we will explain model underfitting and overfitting, which happens during the training process.
在机器学习中,终极目标是建立一个可以很好地概括未见过的数据的模型。然而,有时,模型可能由于欠拟合或过拟合而无法实现此目标。
In machine learning, the ultimate goal is to build a model that can generalize well on unseen data. However, sometimes, a model can fail to achieve this goal due to either underfitting or overfitting.
当模型太简单了,无法捕获数据中的潜在模式。换句话说,模型无法正确学习特征与目标变量之间的关系。这可能会导致训练和测试数据的性能都很差。例如,在图 3 .4中,我们可以看到模型欠拟合,并且不能很好地呈现数据。这不是我们在机器学习模型中所喜欢的,我们通常喜欢看到精确的模型,如图3 .5所示:
Underfitting occurs when a model is too simple to capture the underlying patterns in the data. In other words, the model can’t learn the relationship between the features and the target variable properly. This can result in poor performance on both the training and testing data. For example, in Figure 3.4, we can see that the model is underfitted, and it cannot present the data very well. This is not what we like in machine learning models, and we usually like to see a precise model, as shown in Figure 3.5:
图 3.4 – 机器学习模型对训练数据的欠拟合
Figure 3.4 – The machine learning model underfitting on the training data
当模型处于以下状态时会发生欠拟合没有训练好,或者模型复杂度不够足以捕捉数据中的潜在模式。为了解决这个问题,我们可以使用更复杂的模型,并继续训练过程:
Underfitting happens when the model is not trained well, or the model complexity is not enough to catch the underlying pattern in the data. To solve this problem, we can use more complex models, and continue the training process:
图 3.5 – 机器学习模型在训练数据上的最佳拟合
Figure 3.5 – Optimal fitting of the machine learning model on the training data
当模型很好地捕捉了数据中的模式,但是不会过度拟合每个样本。这有助于模型更好地处理看不见的数据:
Optimal fitting happens when the model captures the pattern in the data pretty well but does not overfit every single sample. This helps the model work better on unseen data:
图 3.6 – 在训练数据上过度拟合模型
Figure 3.6 – Overfitting the model on the training data
另一方面,过度拟合当模型过于复杂且与训练数据拟合得太紧密时,就会发生这种情况,这可能会导致对新的、未见过的数据的泛化能力较差,例如如图3.6所示。当模型学习训练数据中的噪声或随机波动而不是底层模式时,就会发生这种情况。换句话说,模型对于训练数据来说变得过于专业,并且在测试数据上表现不佳。如上图所示,模型出现了过拟合,并且模型试图非常精确地预测每个样本。该模型的问题在于,它没有学习一般模式,而是学习每个单独样本的模式,这使得它在面对新的、未见过的记录时表现不佳。
On the other hand, overfitting occurs when a model is too complex and fits the training data too closely, which can lead to poor generalization on new, unseen data, as shown in Figure 3.6. This happens when the model learns the noise or random fluctuations in the training data, rather than the underlying patterns. In other words, the model becomes too specialized for the training data and does not perform well on the testing data. As shown in the preceding figure, the model is overfitted, and the model tried to predict every single sample very precisely. The problem with this model is that it does not learn the general pattern, and learns the pattern of each individual sample, which makes it work poorly when facing new, unseen records.
理解的一个有用的方法欠拟合和过拟合之间的权衡是通过偏差方差权衡。偏差是指模型的预测值与训练数据中的实际值之间的差异。高偏差意味着模型不够复杂,无法捕获数据中的潜在模式并且与数据拟合不足(图 3 .7)。欠拟合模型在训练和测试数据上的性能都很差:
A useful way to understand the trade-off between underfitting and overfitting is through the bias-variance trade-off. Bias refers to the difference between the predicted values of the model and the actual values in the training data. A high bias means that the model is not complex enough to capture the underlying patterns in the data and underfits the data (Figure 3.7). An underfit model has poor performance on both the training and testing data:
图 3.7 – 高偏差
Figure 3.7 – High bias
另一方面,方差是指模型对小波动的敏感性在训练数据中。高方差意味着模型过于复杂并且对数据过度拟合,从而导致对新数据的泛化性能较差。过拟合模型在训练数据上表现良好,但在测试数据上表现较差:
Variance, on the other hand, refers to the sensitivity of the model to small fluctuations in the training data. A high variance means that the model is overly complex and overfits the data, which leads to poor generalization performance on new data. An overfit model has good performance on the training data but poor performance on the testing data:
图 3.8 – 恰到好处(偏差不高,方差不高)
Figure 3.8 – Just right (not high bias, not high variance)
在偏见之间取得平衡和方差,我们需要选择一个模型既不太简单也不太复杂。如前所述,这通常称为偏差-方差权衡(图 3 .8)。具有高偏差和低方差的模型可以通过增加模型的复杂性来改进,而具有高方差和低偏差的模型可以通过降低模型的复杂性来改进:
To strike a balance between bias and variance, we need to choose a model that is neither too simple nor too complex. As mentioned previously, this is often referred to as the bias-variance trade-off (Figure 3.8). A model with a high bias and low variance can be improved by increasing the complexity of the model, while a model with a high variance and low bias can be improved by decreasing the complexity of the model:
图 3.9 – 高方差
Figure 3.9 – High variance
有多种方法可以减少模型中的偏差和方差。一种常见的方法是正则化,即在损失函数中添加惩罚项来控制模型的复杂度。另一种方法是使用集成,它将多个模型通过减少方差来提高整体性能。交叉验证还可以用于评估模型的性能并调整其超参数以找到偏差和方差之间的最佳平衡。
There are several methods to reduce bias and variance in a model. One common approach is regularization, which adds a penalty term to the loss function to control the complexity of the model. Another approach is to use ensembles, which combine multiple models to improve the overall performance by reducing the variance. Cross-validation can also be used to evaluate the model’s performance and tune its hyperparameters to find the optimal balance between bias and variance.
总的来说,理解偏差和方差在机器学习中至关重要,因为它有助于我们选择合适的模型并识别模型中的误差源。
Overall, understanding bias and variance is crucial in machine learning as it helps us to choose an appropriate model and identify the sources of error in the model.
偏差是指通过使用简化模型来近似现实世界问题而引入的误差。另一方面,方差是指模型对训练数据的小波动的敏感性所引入的误差。
Bias refers to the error that is introduced by approximating a real-world problem with a simplified model. Variance, on the other hand, refers to the error that is introduced by the model’s sensitivity to small fluctuations in the training data.
当模型存在高偏差时且方差低,欠拟合。这意味着该模型没有捕获问题的复杂性,并做出过于简单化的假设。当模型具有低偏差和高方差时,它就会过度拟合。这意味着该模型对训练数据过于敏感,并且正在拟合噪声而不是底层模式。
When a model has high bias and low variance, it is underfitting. This means that the model is not capturing the complexity of the problem and is making overly simplistic assumptions. When a model has low bias and high variance, it is overfitting. This means that the model is too sensitive to the training data and is fitting the noise instead of the underlying patterns.
为了克服欠拟合,我们可以尝试增加模型的复杂性,添加更多特征,或者使用更复杂的算法。为了防止过度拟合,可以使用以下几种方法:
To overcome underfitting, we can try increasing the complexity of the model, adding more features, or using a more sophisticated algorithm. To prevent overfitting, several methods can be used:
图 3.10 – 提前停止
Figure 3.10 – Early stopping
通过使用这些技术,可以可以防止过度拟合并构建能够很好地推广到新的、未见过的数据的模型。在实践中,监控模型的训练和测试性能并进行相应调整以实现最佳生成非常重要实现性能。我们将在下一节中解释如何将数据分为训练和测试。
By using these techniques, it is possible to prevent overfitting and build models that generalize well to new, unseen data. In practice, it is important to monitor both the training and testing performance of the model and make adjustments accordingly to achieve the best possible generalization performance. We will explain how to split our data into training and testing in the next section.
当开发一个机器学习模型,将数据分为训练集、验证集和测试集很重要;这称为数据分割。这样做是为了评估模型在新的、未见过的数据上的性能并防止过度拟合。
When developing a machine learning model, it’s important to split the data into training, validation, and test sets; this is called data splitting. This is done to evaluate the performance of the model on new, unseen data and to prevent overfitting.
最常见的数据分割方法是训练-测试分割,即将数据分割为两个集合:训练集,用于训练模型;测试集,用于评估模型的性能。数据被随机分为两组,典型的划分是 80% 的数据用于训练,20% 的数据用于测试。使用这种方法,将使用大部分数据(训练数据)来训练模型,然后对剩余数据(测试集)进行测试。使用这种方法,我们可以确保模型的性能基于新的、未见过的数据。
The most common method for splitting the data is the train-test split, which splits the data into two sets: the training set, which is used to train the model, and the test set, which is used to evaluate the performance of the model. The data is randomly divided into two sets, with a typical split being 80% of the data for training and 20% for testing. Using this approach the model will be trained using the majority of the data (training data) and then tested on the remaining data (test set). Using this approach, we can ensure that the model’s performance is based on new, unseen data.
在机器学习模型开发的大部分时间里,我们都有一组超参数我们想要调整的模型(我们将在下一小节中解释超参数调整)。在这种情况下,我们希望确保我们在测试集上获得的性能是可靠的,而不仅仅是基于一组超参数的偶然结果。在这种情况下,根据我们训练数据的大小,我们可以将数据分为60%、20%和20%(或70%、15%和15%)进行训练、验证和测试。在这种情况下,我们在训练数据上训练模型,并选择能够在验证集上提供最佳性能的超参数集。然后,我们报告测试集上的实际模型性能,这是之前在模型训练或超参数选择期间从未见过或使用过的。
Most of the time in machine learning model development, we have a set of hyperparameters for our model that we like to tune (we will explain hyperparameter tuning in the next subsection). In this case, we like to make sure that the performance that we get on the test set is reliable and not just by chance based on a set of hyperparameters. In this case, based on the size of our training data, we can divide the data into 60%, 20%, and 20% (or 70%, 15%, and 15%) for training, validation, and testing. In this case, we train the model on the training data and select the set of hyperparameters that give us the best performance on the validation set. We then report the actual model performance on the test set, which has not been seen or used before during model training or hyperparameter selection.
一种更高级的数据分割方法是 k 折交叉验证,特别是当我们的训练数据大小有限时。在这种方法中,数据被分成k 个大小相等的“折叠”,模型被训练和测试k次,每个折叠用作测试集一次,其余折叠用作训练集。然后对每次折叠的结果进行平均,以获得模型性能的总体衡量标准。K 折交叉验证对于小数据集很有用,其中训练-测试分割可能会导致性能评估出现较大差异。在这种情况下,我们报告模型在k 次折叠中每一次的平均、最小和最大性能,如图3 .11所示。
A more advanced method for splitting the data, especially when the size of our training data is limited, is k-fold cross-validation. In this method, the data is split into k equally sized “folds,” and the model is trained and tested k times, with each fold being used as the test set once and the remaining folds used as the training set. The results of each fold are then averaged to get an overall measure of the model’s performance. K-fold cross-validation is useful for small datasets where the train-test split may result in a large variance in performance evaluation. In this case, we report the average, minimum, and maximum performance of the model on each of the k folds, as shown in Figure 3.11.
图 3.11 – K 折交叉验证
Figure 3.11 – K-fold cross-validation
k 重交叉验证的另一个变体是分层 k 重交叉验证,它确保目标变量的分布在所有折叠中保持一致。这在处理不平衡数据集时非常有用,其中一个类的实例数量远小于其他类的实例数量。
Another variant of k-fold cross-validation is stratified k-fold cross-validation, which ensures that the distribution of the target variable is consistent across all folds. This is useful when dealing with imbalanced datasets, where the number of instances of one class is much smaller than the others.
时间序列数据分割时需要特别注意。在这种情况下,我们通常使用一种称为时间的方法系列交叉验证,保留数据的时间顺序。在该方法中,数据被分成多个段,每个段代表一个固定的时间间隔。然后,该模型根据过去的数据进行训练,并根据未来的数据进行测试。这有助于评估模型在现实场景中的性能。您可以在图 3.12中看到如何拆分时间序列问题中的数据的示例:
Time series data requires special attention when splitting. In this case, we typically use a method called time series cross-validation, which preserves the temporal order of the data. In this method, the data is split into multiple segments, with each segment representing a fixed time interval. The model is then trained on the past data and tested on the future data. This helps to evaluate the performance of the model in real-world scenarios. You can see an example of how to split the data in a time series problem in Figure 3.12:
图 3.12 – 时间序列数据分割
Figure 3.12 – Time series data splitting
在所有情况下,重要的是确保分割是随机进行的,但每次都使用相同的随机种子,以确保结果的可重复性。确保分割能够代表基础数据也很重要——也就是说,目标变量的分布在所有集合中应该保持一致。一旦我们将数据分成不同的子集来训练和测试我们的模型,我们就可以尝试找到最好的我们模型的一组超参数。这个过程称为超参数调整,接下来将进行解释。
In all cases, it’s important to ensure that the split is done randomly but with the same random seed each time to ensure the reproducibility of the results. It’s also important to ensure that the split is representative of the underlying data – that is, the distribution of the target variable should be consistent across all sets. Once we have split the data into different subsets for training and testing our model, we can try to find the best set of hyperparameters for our model. This process is called hyperparameter tuning and will be explained next.
超参数调整是机器学习过程中的重要一步,涉及选择最佳的一组给定模型的超参数。超参数是在训练过程开始之前设置的值,可以对模型的性能产生重大影响。超参数的示例包括学习率、正则化强度、神经网络中隐藏层的数量等等。
Hyperparameter tuning is an important step in the machine learning process that involves selecting the best set of hyperparameters for a given model. Hyperparameters are values that are set before the training process begins and can have a significant impact on the model’s performance. Examples of hyperparameters include learning rate, regularization strength, number of hidden layers in a neural network, and many others.
超参数调整的过程涉及选择最佳的超参数组合,从而实现模型的最佳性能。这通常是通过搜索一组预定义的超参数并评估其在验证集上的性能来完成的。
The process of hyperparameter tuning involves selecting the best combination of hyperparameters that results in the optimal performance of the model. This is typically done by searching through a predefined set of hyperparameters and evaluating their performance on a validation set.
超参数调整的方法有多种,包括网格搜索、随机搜索和贝叶斯优化。网格搜索涉及创建所有可能的超参数组合的网格,并在验证集上评估每个超参数组合以确定最佳组合一组超参数。另一方面,随机搜索从预定义的分布中随机采样超参数,并评估其在验证集上的性能。
There are several methods for hyperparameter tuning, including grid search, random search, and Bayesian optimization. Grid search involves creating a grid of all possible hyperparameter combinations and evaluating each one on a validation set to determine the optimal set of hyperparameters. Random search, on the other hand, randomly samples hyperparameters from a predefined distribution and evaluates their performance on a validation set.
随机搜索和网格搜索是用于完全或随机搜索搜索空间的方法,无需考虑之前的超参数结果。因此,这些方法效率低下。提出了另一种贝叶斯优化方法,该方法迭代计算函数的后验分布并考虑过去的评估以找到最佳超参数。使用这种方法,我们可以用更少的迭代找到最佳的超参数集。
Random search and grid search are methods that are used to search the search space, entirely or randomly, without considering previous hyperparameter results. Thus, these methods are inefficient. An alternative Bayesian optimization method has been proposed that iteratively computes the posterior distribution of the function and considers past evaluations to find the best hyperparameters. Using this approach, we can find the best set of hyperparameters with less iterations.
贝叶斯优化利用过去的评估将超参数概率映射到目标函数分数,如以下等式所示:
Bayesian optimization utilizes past evaluations to probabilistically map hyperparameters to objective function scores, as demonstrated in the following equation:
以下是贝叶斯优化执行的步骤:
Here are the steps Bayesian optimization undertakes:
基于序列模型的优化( SMBO ) 方法是贝叶斯优化的形式化,并进行了试验一个接一个,每次都尝试更好的超参数并更新概率模型(代理)。SMBO 方法在步骤 3和4上有所不同——具体来说,它们如何构建目标函数的代理以及用于选择下一个超参数的标准。这些变体包括高斯过程、随机森林回归和树结构 Parzen 估计器等。
Sequential model-based optimization (SMBO) methods are a formalization of Bayesian optimization, with trials run one after another, trying better hyperparameters each time and updating a probability model (surrogate). SMBO methods differ in steps 3 and 4 – specifically, how they build a surrogate of the objective function and the criteria used to select the next hyperparameters. These variants include Gaussian processes, random forest regressions, and tree-structured Parzen estimators, among others.
在具有数值超参数的低维问题中,贝叶斯优化被认为是最佳的可用超参数优化方法。然而,它仅限于中等规模的问题。
In low-dimensional problems with numerical hyperparameters, Bayesian optimization is considered the best available hyperparameter optimization method. However, it is restricted to problems of moderate dimension.
除了这些方法之外,还有几个可用的库可以自动执行超参数调整过程。这些库的示例包括 scikit-learn 的GridSearchCV和RandomizedSearchCV、Keras Tuner和Optuna。这些库允许高效的超参数调整,并且可以显着提高机器学习模型的性能。
In addition to these methods, there are also several libraries available that automate the process of hyperparameter tuning. Examples of these libraries include scikit-learn’s GridSearchCV and RandomizedSearchCV, Keras Tuner, and Optuna. These libraries allow for efficient hyperparameter tuning and can significantly improve the performance of machine learning models.
机器学习中的超参数优化可能是一个复杂且耗时的过程。搜索过程中出现两个主要的复杂性挑战:试验执行时间和搜索空间的复杂性,包括评估的超参数组合的数量。在深度学习中,由于广泛的搜索空间和大型训练集的利用,这些挑战尤其相关。
Hyperparameter optimization in machine learning can be a complex and time-consuming process. Two primary complexity challenges arise in the search process: the trial execution time and the complexity of the search space, including the number of evaluated hyperparameter combinations. In deep learning, these challenges are especially pertinent due to the extensive search space and the utilization of large training sets.
为了解决这些问题并减少搜索空间,可以使用一些标准技术。例如,基于统计采样减小训练数据集的大小或应用特征选择技术可以帮助减少每次试验的执行时间。此外,确定最重要的优化超参数并使用精度之外的其他目标函数(例如操作数或优化时间)可以帮助降低搜索空间的复杂性。
To address these issues and reduce the search space, some standard techniques may be used. For example, reducing the size of the training dataset based on statistical sampling or applying feature selection techniques can help reduce the execution time of each trial. Additionally, identifying the most important hyperparameters for optimization and using additional objective functions beyond just accuracy, such as the number of operations or optimization time, can help reduce the complexity of the search space.
通过反卷积网络将准确性与可视化相结合,研究人员取得了优异的结果。然而,值得注意的是,这些技术并不详尽,最佳方法可能取决于当前的具体问题。
By combining accuracy with visualization through a deconvolution network, researchers have achieved superior results. However, it’s important to note that these techniques are not exhaustive, and the best approach may depend on the specific problem at hand.
另一种常见的方法提高模型性能的方法是使用多个并行模型;这些称为集成模型。它们在处理机器学习问题时非常有用。
Another common approach for improving model performance is to use multiple models in parallel; these are called ensemble models. They are very useful in dealing with machine learning problems.
集成建模是一个机器学习中的技术,结合多个模型的预测来提高整体性能。集成模型背后的想法是,多个模型可能比单个模型更好,因为不同的模型可能捕获不同的信息数据中的t模式。
Ensemble modeling is a technique in machine learning that combines the predictions of multiple models to improve overall performance. The idea behind ensemble models is that multiple models can be better than a single model as different models may capture different patterns in the data.
集成模型有多种类型,我们将在以下部分中介绍所有这些类型。
There are several types of ensemble models, all of which we’ll cover in the following sections.
Bootstrap聚合也称为bagging,是一种结合多个独立的集成方法。在不同训练子集上训练的模型数据以减少方差并提高模型泛化能力。
Bootstrap aggregating, also known as bagging, is an ensemble method that combines multiple independent models trained on different subsets of the training data to reduce variance and improve model generalization.
bagging算法可以总结如下:
The bagging algorithm can be summarized as follows:
当基础模型不稳定(即具有高方差)(例如决策树)以及训练数据集较小时,bagging 算法特别有效。
The bagging algorithm is particularly effective when the base models are unstable (that is, have high variance), such as decision trees, and when the training dataset is small.
用于聚合基本模型预测的方程取决于问题的类型(分类或回归)。对于分类,通过多数投票获得集成预测:
The equation for aggregating the predictions of the base models depends on the type of problem (classification or regression). For classification, the ensemble prediction is obtained by taking the majority vote:
这里,是实例的基础模型的预测类, I()是指示函数(如果 x 为真则等于 1,否则等于 0)。
Here, is the predicted class of the base model for the instance and I() is the indicator function (equal to 1 if x is true, and 0 otherwise).
对于回归,通过取平均分数来获得整体预测:
For regression, the ensemble prediction is obtained by taking the average score:
这里,是基本模型的预测值。
Here, is the predicted value of the base model.
The advantages of bagging are as follows:
套袋的缺点如下:
The disadvantages of bagging are as follows:
Boosting 是另一种流行的方法集成学习技术,旨在通过将弱分类器组合成更强的分类器来提高它们的性能。与 bagging 不同,boosting专注于通过调整训练样本的权重来迭代提高分类器的准确性。boosting背后的基本思想是从先前弱分类器的错误中学习,并更加重视在先前迭代中错误分类的示例。
Boosting is another popular ensemble learning technique that aims to improve the performance of weak classifiers by combining them into a stronger classifier. Unlike bagging, boosting focuses on iteratively improving the accuracy of the classifier by adjusting the weights of the training examples. The basic idea behind boosting is to learn from the mistakes of the previous weak classifiers and to put more emphasis on the examples that were incorrectly classified in the previous iteration.
Boosting 算法有多种,但最流行的算法之一是 AdaBoost(adaptive 的缩写)增强)。AdaBoost算法的工作原理如下:
There are several boosting algorithms, but one of the most popular ones is AdaBoost (short for adaptive boosting). The AdaBoost algorithm works as follows:
最终的分类器是弱分类器的加权组合。每个弱分类器的重要性由其加权错误率决定,计算公式如下:
The final classifier is a weighted combination of the weak classifiers. The importance of each weak classifier is determined by its weighted error rate, which is computed as an equation:
这里,m是弱分类器的索引,N是训练样例的数量,是训练样例的权重,是训练样例的真实标签,是弱分类器对训练样例的预测,并且是一个指示函数,如果弱分类器的预测不正确,则返回 1,否则返回 0。
Here, m is the index of the weak classifier, N is the number of training examples, is the weight of the training example, is the true label of the training example, is the prediction of the weak classifier for the training example, and is an indicator function that returns 1 if the prediction of the weak classifier is incorrect and 0 otherwise.
弱分类器的重要性由以下等式计算:
The importance of the weak classifier is computed by the following equation:
示例的权重根据其重要性进行更新:
The weights of the examples are updated based on their importance:
然后通过组合弱分类器得到最终的分类器:
The final classifier is then obtained by combining the weak classifiers:
这里,M是弱的总数classifiers是第 m 个弱分类器的预测,sign()是一个函数,如果其参数为正,则返回 +1,否则返回 - 1。
Here, M is the total number of weak classifiers, is the prediction of the m-th weak classifier, and sign() is a function that returns +1 if its argument is positive and -1 otherwise.
Let’s look at some of the advantages of boosting:
以下是boosting 的一些缺点:
Here are some of the disadvantages of boosting:
堆叠是另一回事流行的集成学习技术结合了通过根据预测训练更高级别的模型来预测多个基本模型。堆叠背后的想法是利用不同基础模型的优势来实现更好的预测性能。
Stacking is another popular ensemble learning technique that combines the predictions of multiple base models by training a higher-level model on their predictions. The idea behind stacking is to leverage the strengths of different base models to achieve better predictive performance.
堆叠的工作原理如下:
Here’s how stacking works:
更高级别的模型通常是简单模型,例如线性回归、逻辑回归或决策树。这个想法是使用基本模型的预测作为更高 l 的输入特征埃韦尔模型。这样,高层模型就可以学习结合基础模型的预测来做出更准确的预测。
The higher-level model is typically a simple model such as a linear regression, logistic regression, or a decision tree. The idea is to use the predictions of the base models as input features for the higher-level model. This way, the higher-level model learns to combine the predictions of the base models to make more accurate predictions.
其中最...之一众所周知的集成模型是随机森林,该模型结合了多个决策的预测树和output 预测。这通常更准确并且容易出现过度拟合。我们在本章前面详细阐述了随机森林。
One of the most commonly known ensemble models is random forest, where the model combines the predictions of multiple decision trees and outputs the predictions. This is usually more accurate and prone to overfitting. We elaborated on Random Forest earlier in this chapter.
梯度提升是另一种可用于分类的集成模型和回归任务。它的工作原理是获取一个弱分类器(例如一棵简单的树),并在每个步骤中尝试改进这个弱分类器以构建更好的模型。这里的主要思想是,模型试图关注每一步中的错误,并通过纠正先前树中所犯的错误来拟合模型来改进自身。
Gradient boosting is another ensemble model that can be used for classification and regression tasks. It works by getting a weak classifier (such as a simple tree), and in each step tries to improve this weak classifier to build a better model. The main idea here is that the model tries to focus on its mistakes in each step and improve itself by fitting the model by correcting the errors made in previous trees.
在每次迭代期间,该算法计算与预测值相关的损失函数的负梯度,然后将决策树拟合到这些负梯度值。然后,使用控制每棵树对最终预测的贡献的学习率参数,将新树的预测与先前树的预测相结合。
During each iteration, the algorithm computes the negative gradient of the loss function concerning the predicted values, followed by fitting a decision tree to these negative gradient values. The predictions of the new tree are then combined with the predictions of the previous trees, using a learning rate parameter that controls the contribution of each tree to the final prediction.
梯度提升模型的整体预测是通过将所有树的预测相加得到的,并按各自的学习率进行加权。
The overall prediction of the gradient boosting model is obtained by summing up the predictions of all the trees, which are weighted by their respective learning rates.
让我们看一下梯度增强算法的方程。
Let’s take a look at the equation for the gradient boosting algorithm.
首先,我们使用常量值初始化模型:
First, we initialize the model with a constant value:
这里,c是一个常数,是第i个样本的真实标签,N是样本数量,L是损失函数,用于衡量预测标签与真实标签之间的误差。
Here, c is a constant, is the true label of the i-th sample, N is the number of samples, and L is the loss function, which is used to measure the error between the predicted and true labels.
在每次迭代m时,算法将决策树拟合到与预测值 相关的损失函数的负梯度值。决策树预测负梯度值,则为用于通过以下等式更新模型的预测:
At each iteration, m, the algorithm fits a decision tree to the negative gradient values of the loss function concerning the predicted values, . The decision tree predicts the negative gradient values, which are then used to update the predictions of the model via the following equation:
其中,是模型在上一次迭代时的预测,η是学习率,是决策树在当前迭代时的预测。
Here, is the prediction of the model at the previous iteration, η is the learning rate, and is the prediction of the decision tree at the current iteration.
模型的最终预测是通过组合所有树的预测得到的:
The final prediction of the model is obtained by combining the predictions of all the trees:
这里,M是模型中树的总数, 和分别是第 m棵树的学习率和预测。
Here, M is the total number of trees in the model and and are the learning rate and prediction of the m-th tree, respectively.
让我们看看梯度提升的一些优点:
Let’s look at some of the advantages of gradient boosting:
现在,让我们看看一些缺点:
Now, let’s look at some of the disadvantages:
至此,我们回顾了集成模型可以帮助我们提高模型性能。然而,有时,我们的数据集具有一些在应用机器学习模型之前需要考虑的特征。一种常见的情况是我们的数据集不平衡。
With that, we have reviewed the ensemble models that can help us improve our model performance. However, sometimes, our dataset has some features that we need to consider before we apply machine learning models. One common case is when we have an imbalanced dataset.
在大多数现实世界的问题中,我们的数据不平衡,这意味着不同类别(例如患有癌症和未患有癌症的患者)的记录分布不同。处理不平衡的数据集是机器学习中的一项重要任务,因为数据集的类别分布不均匀是很常见的。在这种情况下,少数群体的代表性往往不足,这可能会导致模型性能不佳和预测出现偏差。这背后的原因是机器学习方法试图优化其适应度函数以最小化训练集中的误差。现在,假设我们有 99% 的数据来自正类,1% 的数据来自负类。在这种情况下,如果模型将所有记录预测为正,则误差将为 1%;然而,这个模型对我们来说没有用。这就是为什么,如果我们有一个不平衡的数据集,我们需要使用各种方法来处理不平衡的数据。一般来说,我们可以采用三类方法来处理不平衡数据集:
In most real-world problems, our data is imbalanced, which means that the distribution of records from different classes (such as patients with and without cancer) is different. Handling imbalanced datasets is an important task in machine learning as it is common to have datasets with uneven class distribution. In such cases, the minority class is often under-represented, which can cause poor model performance and biased predictions. The reason behind this is that machine learning methods are trying to optimize their fitness function to minimize the error in the training set. Now, let’s say that we have 99% of the data from the positive class and 1% from the negative class. In this case, if the model predicts all records as positive, the error will be 1%; however, this model is not useful for us. That’s why, if we have an imbalanced dataset, we need to use various methods to handle imbalanced data. In general, we can have three categories of methods to handle imbalanced datasets:
SMOTE是一种广泛使用的处理机器学习中不平衡数据集的算法。它是一种合成数据生成技术,通过在现有样本之间进行插值来创建少数类别中的新合成样本。SMOTE 的工作原理是识别少数群体的 k 最近邻类样本,然后沿着连接这些邻居的线段生成新样本。
SMOTE is a widely used algorithm for handling imbalanced datasets in machine learning. It is a synthetic data generation technique that creates new, synthetic samples in the minority class by interpolating between existing samples. SMOTE works by identifying the k-nearest neighbors of a minority class sample and then generating new samples along the line segments that connect these neighbors.
SMOTE算法的步骤如下:
Here are the steps of the SMOTE algorithm:
这将创建一个位于x和x'之间的新样本,但与任一样本都不相同。
This creates a new sample that is somewhere between x and x’, but not the same as either one.
4. 重复步骤 1至3 ,直至生成所需数量的合成样品。
4. Repeat steps 1 to 3 until the desired number of synthetic samples has been generated.
Here are the advantages and disadvantages of SMOTE:
这是一个例子SMOTE 行动中。假设我们有一个包含两个类的数据集:多数类(类 0)有 900 个样本,少数类(类 1)有 100 个样本。我们想要使用 SMOTE 为少数类别生成合成样本:
Here is an example of SMOTE in action. Suppose we have a dataset with two classes: the majority class (class 0) has 900 samples, and the minority class (class 1) has 100 samples. We want to use SMOTE to generate synthetic samples for the minority class:
例如,假设x为 ( 1, 2 ),x'为 ( 3, 4 ),r为0.5。在本例中,新样本如下:
For example, suppose x is (1, 2), x’ is (3, 4), and r is 0.5. In this case, the new sample is as follows:
4. 我们重复步骤 1至3,直到生成所需数量的合成样本。例如,假设我们想要生成 100 个合成样本。我们对 100 个少数群体中的每一个重复步骤 1到3样品,然后将原始少数类样本与合成样本结合起来,创建一个每类 200 个样本的平衡数据集。
4. We repeat steps 1 to 3 until we have generated the desired number of synthetic samples. For example, suppose we want to generate 100 synthetic samples. We repeat steps 1 to 3 for each of the 100 minority class samples and then combine the original minority class samples with the synthetic samples to create a balanced dataset with 200 samples in each class.
NearMiss算法是一种通过对主要类中的记录进行欠采样(删除)来平衡类分布的技术。当两个类中的记录彼此非常接近,从多数类中删除一些记录会增加两个类之间的距离,这有助于分类过程。为了避免大多数欠采样方法中的信息丢失问题,广泛使用近差错方法。
The NearMiss algorithm is a technique for balancing class distribution by undersampling (removing) the records from the major class. When two classes have records that are very close to each other, eliminating some of the records from the majority class increases the distance between the two classes, which helps the classification process. To avoid information loss problems in the majority of undersampling methods, near-miss methods are widely used.
最近邻方法的工作基于以下步骤:
The working of nearest-neighbor methods is based on the following steps:
我们可以使用NearMiss算法的三种变体来查找主类中n 个最接近的记录:
There are three variations of applying the NearMiss algorithm that we can use to find the n closest records in the major class:
成本敏感学习是一种用于在不平衡数据集上训练机器学习模型的方法。在不平衡的数据集,一个类(通常是少数类)中的示例数量远低于另一类(通常是多数类)的示例数量。成本敏感学习涉及分配错误分类成本根据所预测的类别而有所不同的模型,这可以帮助模型更加专注于正确分类少数类别。
Cost-sensitive learning is a method that’s used to train machine learning models on imbalanced datasets. In imbalanced datasets, the number of examples in one class (usually the minority class) is much lower than in the other class (usually the majority class). Cost-sensitive learning involves assigning misclassification costs to the model that differ based on the class being predicted, which can help the model focus more on correctly classifying the minority class.
假设我们有一个二元分类问题,有两个类别:正类和负类。在成本敏感学习中,我们为不同类型的错误分配不同的成本。例如,我们可能会为将正例错误分类为负例分配更高的成本,因为在不平衡的数据集中,正类是少数类,而错误分类正例会对模型的性能产生更大的影响。
Let’s assume we have a binary classification problem with two classes, positive and negative. In cost-sensitive learning, we assign different costs to different types of errors. For example, we may assign a higher cost to misclassifying a positive example as negative because in an imbalanced dataset, the positive class is the minority class, and misclassifying positive examples can have a greater impact on the performance of the model.
我们可以以混淆矩阵的形式分配成本:
We can assign costs in the form of a confusion matrix:
|
预测为阳性 Predicted Positive |
预测阴性 Predicted Negative |
|
|
实际积极 Actual Positive |
TP成本 TP_cost |
FN_成本 FN_cost |
|
实际负面 Actual Negative |
FP_成本 FP_cost |
TN_成本 TN_cost |
表 3.2 – 混淆矩阵成本
Table 3.2 – Confusion matrix costs
这里,TP_cost、FN_cost、FP_cost和TN_cost分别是与真阳性、假阴性、假阳性和真阴性相关的成本。
Here, TP_cost, FN_cost, FP_cost, and TN_cost are the costs associated with true positives, false negatives, false positives, and true negatives, respectively.
将成本矩阵纳入在训练过程中,我们可以修改标准损失函数模型在训练期间进行优化。一种常见的成本敏感损失函数是加权交叉熵损失,其定义如下:
To incorporate the cost matrix into the training process, we can modify the standard loss function that the model optimizes during training. One common cost-sensitive loss function is the weighted cross-entropy loss, which is defined as follows:
这里,y是真实标签(0 或 1),是正类的预测概率, 和是分别分配给正类和负类的权重。
Here, y is the true label (either 0 or 1), is the predicted probability of the positive class, and and are weights that are assigned to the positive and negative classes, respectively.
权重和可以通过混淆矩阵中分配的成本来确定。例如,如果我们为假阴性分配更高的成本(即将正例错误分类为负例),我们可以设置为比更高的值。
The weights, and , can be determined by the costs assigned in the confusion matrix. For example, if we assign a higher cost to false negatives (that is, misclassifying a positive example as negative), we may set to a higher value than .
成本敏感学习还可以与其他类型的模型一起使用,例如决策树和支持向量机。将成本分配给不同类型的错误的概念可以以多种方式应用,以提高模型在不平衡数据集上的性能。然而,根据数据集的具体特征和要解决的问题仔细选择合适的成本矩阵和损失函数非常重要:
Cost-sensitive learning can also be used with other types of models, such as decision trees and SVMs. The concept of assigning costs to different types of errors can be applied in various ways to improve the performance of a model on imbalanced datasets. However, it’s important to carefully select the appropriate cost matrix and loss function based on the specific characteristics of the dataset and the problem being solved:
数据背后的想法增强是通过对原始示例应用转换来生成新示例,同时仍然保留标签。这些变换可以包括旋转、平移、缩放、翻转和添加噪声等。这对于不平衡的数据集特别有用,其中一个类中的示例数量比另一类中的示例数量少得多。
The idea behind data augmentation is to generate new examples by applying transformations to the original ones, while still retaining the label. These transformations can include rotation, translation, scaling, flipping, and adding noise, among others. This can be particularly useful for imbalanced datasets, where the number of examples in one class is much smaller than in the other.
在不平衡数据集的情况下,数据增强可用于创建少数类的新示例,从而有效地平衡数据集。这可以通过将相同的一组转换应用于少数类示例来完成,创建一组仍然代表少数类但与原始示例略有不同的新示例。
In the context of imbalanced datasets, data augmentation can be used to create new examples of the minority class, effectively balancing the dataset. This can be done by applying the same set of transformations to the minority class examples, creating a new set of examples that are still representative of the minority class but are slightly different from the original ones.
数据增强中涉及的方程相对简单,因为它们基于将变换函数应用于原始示例。例如,要将图像旋转一定角度,我们可以使用旋转矩阵:
The equations that are involved in data augmentation are relatively simple as they are based on applying transformation functions to the original examples. For example, to rotate an image by a certain angle, we can use a rotation matrix:
这里,x和y是图像中某个像素点的原始坐标,x'和y'是旋转后的新坐标, x'和y'是旋转的角度。
Here, x and y are the original coordinates of a pixel in the image, x’ and y’ are the new coordinates after rotation, and is the angle of rotation.
同样,要应用平移,我们可以简单地将图像移动一定数量的像素:
Similarly, to apply translation, we can simply shift the image by a certain number of pixels:
这里,dx和dy分别是水平和垂直位移。
Here, dx and dy are the horizontal and vertical shifts, respectively.
数据增强可以是解决不平衡数据集的强大技术,因为它可以创建代表少数类别的新示例,同时仍然保存标签信息。然而,在应用数据增强时一定要小心,因为它还会在数据中引入噪声和伪影,如果处理不当,可能会导致过度拟合。
Data augmentation can be a powerful technique for addressing imbalanced datasets as it can create new examples that are representative of the minority class, while still preserving the label information. However, it is important to be careful when applying data augmentation as it can also introduce noise and artifacts in the data, and can lead to overfitting if not done properly.
总之,处理不平衡数据集是机器学习的一个重要方面。有多种技术可用于处理不平衡数据集,每种技术都有其优点和缺点。技术的选择取决于数据集、问题和可用资源。除了不平衡的数据之外,在处理时间序列数据的情况下,我们可能会面临相关的数据。我们会仔细看看这个下一个。
In conclusion, handling imbalanced datasets is an important aspect of machine learning. There are several techniques available to handle imbalanced datasets, each with its advantages and disadvantages. The choice of technique depends on the dataset, the problem, and the available resources. Besides having imbalanced data, in the case of working on time series data, we might face correlated data. We’ll take a closer look at this next.
与传统技术一样,处理机器学习模型中的相关时间序列数据可能具有挑战性例如随机抽样可能会引入偏差并忽略数据点之间的依赖性。以下是一些可以提供帮助的方法:
Dealing with correlated time series data in machine learning models can be challenging as traditional techniques such as random sampling can introduce biases and overlook dependencies between data points. Here are some approaches that can help:
一般来说,使用保留数据中的时间依赖性和模式的技术来处理时间序列数据非常重要。这可能需要专门的建模技术和预处理ng 步骤。
In general, it’s important to approach time series data with techniques that preserve the temporal dependencies and patterns in the data. This can require specialized modeling techniques and preprocessing steps.
在本章中,我们从数据探索和预处理技术开始,了解了与机器学习相关的各种概念。然后,我们探索了各种机器学习模型,例如逻辑回归、决策树、支持向量机和随机森林,及其优点和缺点。我们还讨论了将数据拆分为训练集和测试集的重要性,以及处理不平衡数据集的技术。
In this chapter, we learned about various concepts related to machine learning, starting with data exploration and preprocessing techniques. We then explored various machine learning models, such as logistic regression, decision trees, support vector machines, and random forests, along with their strengths and weaknesses. We also discussed the importance of splitting data into training and test sets, as well as techniques for handling imbalanced datasets.
本章还介绍了模型偏差、方差、欠拟合和过拟合的概念,以及如何诊断和解决这些问题。我们还探索了 bagging、boosting 和 stacking 等集成方法,这些方法可以通过组合多个模型的预测来提高模型性能。
The chapter also covered the concepts of model bias, variance, underfitting, and overfitting, and how to diagnose and address these issues. We also explored ensemble methods such as bagging, boosting, and stacking, which can improve model performance by combining the predictions of multiple models.
最后,我们了解了机器学习的局限性和挑战,包括需要大量高质量数据、存在偏见和不公平的风险以及解释复杂模型的难度。尽管存在这些挑战,机器学习仍然提供了解决各种问题的强大工具,并有潜力改变许多行业和领域。
Finally, we learned about the limitations and challenges of machine learning, including the need for large amounts of high-quality data, the risk of bias and unfairness, and the difficulty of interpreting complex models. Despite these challenges, machine learning offers powerful tools for solving a wide range of problems and has the potential to transform many industries and fields.
在下一章中,我们将讨论文本预处理,这是机器学习模型使用文本所必需的。
In the next chapter, we will discuss text preprocessing, which is required for text to be used by machine learning models.
文本预处理站作为自然语言处理(NLP)领域至关重要的第一步。它包括转换原始的、未经精炼的文本将数据转换为机器学习算法可以轻松理解的格式。为了从文本数据中提取有意义的见解,必须对数据进行清理、规范化并将其转换为更结构化的形式。本章概述了最常用的文本预处理技术,包括标记化、词干提取、词形还原、停用词删除和词性( POS ) 标记,以及它们的优点和局限性。
Text preprocessing stands as a vital initial step in the realm of natural language processing (NLP). It encompasses converting raw, unrefined text data into a format that machine learning algorithms can readily comprehend. To extract meaningful insights from textual data, it is essential to clean, normalize, and transform the data into a more structured form. This chapter provides an overview of the most commonly used text preprocessing techniques, including tokenization, stemming, lemmatization, stop word removal, and part-of-speech (POS) tagging, along with their advantages and limitations.
有效的文本预处理对于各种 NLP 任务至关重要,包括情感分析、语言翻译和信息检索。通过应用这些技术,原始文本数据可以转换为结构化和标准化的格式,可以使用统计和机器学习方法轻松分析。然而,选择适当的预处理技术可能具有挑战性,因为最佳方法取决于手头的特定任务和数据集。因此,仔细评估和比较不同的文本预处理技术以确定针对给定应用程序的最有效方法非常重要。
Effective text preprocessing is essential for various NLP tasks, including sentiment analysis, language translation, and information retrieval. By applying these techniques, raw text data can be transformed into a structured and normalized format that can be easily analyzed using statistical and machine learning methods. However, selecting the appropriate preprocessing techniques can be challenging since the optimal methods depend on the specific task and dataset at hand. Therefore, it is important to carefully evaluate and compare different text preprocessing techniques to determine the most effective approach for a given application.
本章将涵盖以下主题:
The following topics will be covered in this chapter:
要完成本章中有关文本预处理的示例和练习,您需要具备 Python 等编程语言的应用知识,并且熟悉 NLP 概念。您还需要安装某些库,例如Natural Language Toolkit ( NLTK )、spaCy和scikit-learn。这些库为文本预处理和特征提取提供了强大的工具。建议您访问Jupyter Notebook环境或其他交互式编码环境,以方便实验和探索。此外,使用示例数据集可以帮助您了解各种技术及其对文本数据的影响。
To follow along with the examples and exercises in this chapter on text preprocessing, you will need a working knowledge of a programming language such as Python, as well as some familiarity with NLP concepts. You will also need to have certain libraries installed, such as Natural Language Toolkit (NLTK), spaCy, and scikit-learn. These libraries provide powerful tools for text preprocessing and feature extraction. It is recommended that you have access to a Jupyter Notebook environment or another interactive coding environment to facilitate experimentation and exploration. Additionally, having a sample dataset to work with can help you understand the various techniques and their effects on text data.
文本规范化是将文本转换为标准形式以确保一致性并减少变化的过程。使用不同的技术来规范化文本,包括小写、删除特殊字符、拼写检查以及词干或词形还原。我们将通过代码示例详细解释这些步骤以及如何使用它们。
Text normalization is the process of transforming text into a standard form to ensure consistency and reduce variations. Different techniques are used for normalizing text, including lowercasing, removing special characters, spell checking, and stemming or lemmatization. We will explain these steps in detail, and how to use them, with code examples.
小写是一种常见的文本预处理技术在 NLP 中用于标准化文本并降低词汇的复杂性。在此技术中,所有文本都被转换为小写字符。
Lowercasing is a common text preprocessing technique that’s used in NLP to standardize text and reduce the complexity of vocabulary. In this technique, all the text is converted into lowercase characters.
小写的主要目的是使文本统一并避免因大写而可能出现的任何差异。通过将所有文本转换为小写,机器学习算法可以将大写和非大写的相同单词视为相同,从而减少总体词汇量并使文本更易于处理。
The main purpose of lowercasing is to make the text uniform and avoid any discrepancies that may arise from capitalization. By converting all the text into lowercase, the machine learning algorithms can treat the same words that are capitalized and non-capitalized as the same, reducing the overall vocabulary size and making the text easier to process.
小写对于文本分类、情感分析和语言建模等任务特别有用,这些任务中文本的含义不受单词大小写的影响。然而,它可能不适合某些任务,例如 NER,其中大写可能是一个重要特征。
Lowercasing is particularly useful for tasks such as text classification, sentiment analysis, and language modeling, where the meaning of the text is not affected by the capitalization of the words. However, it may not be suitable for certain tasks, such as NER, where capitalization can be an important feature.
删除特殊字符和标点符号是文本预处理的一个重要步骤。特殊字符和标点符号不会给文本添加太多含义,如果不删除它们可能会给机器学习模型带来问题。执行此任务的一种方法是使用正则表达式,如下所示:
Removing special characters and punctuation is an important step in text preprocessing. Special characters and punctuation marks do not add much meaning to the text and can cause issues for machine learning models if they are not removed. One way to perform this task is by using regular expressions, such as the following:
re.sub(r"[^a-zA-Z0-9]+", "", 字符串)
re.sub(r"[^a-zA-Z0-9]+", "", string) 这将从我们的输入字符串中删除非字符和数字。有时,我们可能想用空格替换某些特殊字符。看看下面的例子:
This will remove non-characters and numbers from our input string. Sometimes, there may be special characters that we would want to replace with a whitespace. Take a look at the following examples:
在这两个示例中,我们希望将“-”替换为空格,如下所示:
In these two examples, we would want to replace the “-” with whitespace, as follows:
接下来,我们将介绍停用词删除。
Next, we’ll cover stop word removal.
停用词就是单词它们对句子或一段文本的含义没有多大贡献,因此可以安全地删除,而不会丢失太多信息。停用词的示例包括“a”、“an”、“the”、“and”、“in”、“at”、“on”、“to”、“for”、“is”、“are”和很快。
Stop words are words that do not contribute much to the meaning of a sentence or piece of text, and therefore can be safely removed without us losing much information. Examples of stop words include “a,” “an,” “the,” “and,” “in,” “at,” “on,” “to,” “for,” “is,” “are,” and so on.
停用词去除是一种常见的文本预处理执行的步骤在进行任何文本分析之前任务,例如情感分析、主题建模或信息检索。目标是减少词汇量的大小和特征空间的维数,可以提高后续分析步骤的效率和效果。
Stop word removal is a common text preprocessing step that is performed before any text analysis tasks, such as sentiment analysis, topic modeling, or information retrieval. The goal is to reduce the size of the vocabulary and the dimensionality of the feature space, which can improve the efficiency and effectiveness of subsequent analysis steps.
停用词删除的过程包括识别停用词列表(通常是预定义的或从语料库学习的),将输入文本标记为单词或标记,然后删除与停用词列表匹配的任何单词。生成的文本仅包含承载文本含义的重要单词。
The process of stop word removal involves identifying a list of stop words (usually predefined or learned from a corpus), tokenizing the input text into words or tokens, and then removing any words that match the stop word list. The resulting text consists of only the important words that carry the meaning of the text.
可以使用各种编程语言、工具和库来执行停用词删除。例如,NLTK是一个流行的NLP Python库,它提供了各种语言的停用词列表,以及从文本中删除停用词的方法。
Stop word removal can be performed using various programming languages, tools, and libraries. For example, NLTK, which is a popular Python library for NLP, provides a list of stop words for various languages, as well as a method for removing stop words from text.
以下是删除停用词的示例:
Here’s an example of stop word removal:
这是一个演示停用词过滤的例句。
This is a sample sentence demonstrating stop word filtration.
执行停用词删除后,我们得到以下输出:
After performing stop word removal, we get the following output:
演示停用词过滤的例句
Sample sentence demonstrating stop word filtration
本章包含专门用于此目的的 Python 代码。您可以参考本章中描述的每个操作。
This chapter contains Python code dedicated to this. You can refer to it for each of the actions that are described in this chapter.
我们可以看到,停止“This”、“is”和“a”等词已从原句子中删除,只留下重要的词。
As we can see, the stop words “This,” “is,” and “a,” have been removed from the original sentence, leaving only the important words.
拼写检查和更正涉及纠正文本中拼写错误的单词。这很重要,因为拼写错误的单词可能会导致数据不一致并影响算法的准确性。例如,看一下下面的句子:
Spell checking and correction involves correcting misspelled words in the text. This is important because misspelled words can cause inconsistencies in the data and affect the accuracy of algorithms. For example, take a look at the following sentence:
我要去面包店
I am going to the bakkery
这将转换为以下内容:
This would be transformed into the following:
我要去面包店
I am going to the bakery
让我们继续进行词形还原。
Let’s move on to lemmatization.
词形还原是一种文本规范化方法旨在将单词简化为其基础或字典形式,称为引理。词形还原的主要目标是聚合同一单词的各种形式,以便将它们作为一个统一的术语进行分析。
Lemmatization is a text normalization approach that aims to simplify a word to its base or dictionary form, referred to as a lemma. The primary objective of lemmatization is to aggregate various forms of the same word, facilitating their analysis as a unified term.
例如,考虑以下句子:
For example, consider the following sentence:
三只猫在田里追老鼠,一只猫看着一只老鼠。
Three cats were chasing the mice in the fields, while one cat watched one mouse.
在这句话的上下文中,“cat”和“cats”是同一个词的两种不同形式,“mouse”和“mice”也是同一个词的两种不同形式。词形还原会将这些单词还原为其基本形式:
In the context of this sentence, “cat” and “cats” are two different forms of the same word, and “mouse” and “mice” are also two different forms of the same word. Lemmatization would reduce these words to their base forms:
猫在田里追老鼠,而一只猫看着一只老鼠。
the cat be chasing the mouse in the field, while one cat watched one mouse.
在这种情况下,“cat”和“cats”都被简化为“cat”的基本形式,“mouse”和“mice”都被简化为“mouse”的基本形式。这可以更好地分析文本,因为“猫”和“老鼠”的出现现在被视为相同的术语,无论它们的屈折变化如何。
In this case, “cat” and “cats” have both been reduced to their base form of “cat,” and “mouse” and “mice” have both been reduced to their base form of “mouse.” This allows for better analysis of the text since the occurrences of “cat” and “mouse” are now treated as the same term, regardless of their inflectional variations.
词形还原与词干提取不同,词干提取涉及将单词简化为公共词干,而该词干本身不一定是单词。例如,“cats”和“cat”的词干都是“cat”。“cats”和“cat”的引理也将是“cat”。
Lemmatization is different from stemming, which involves reducing a word to a common stem that may not necessarily be a word in its own right. For example, the stem of “cats” and “cat” would both be “cat.” The lemma of “cats” and “cat” would be “cat” as well.
可以进行词形还原使用各种 NLP 库和工具,例如 NLTK、spaCy 和斯坦福 CoreNLP。
Lemmatization can be performed using various NLP libraries and tools, such as NLTK, spaCy, and Stanford CoreNLP.
词干提取涉及减少单词到它们的基本或根形式,称为“茎”。这个流程通常在 NLP 中用于准备用于分析、检索或存储的文本。词干算法的工作原理是切断单词的结尾或后缀,只留下词干。
Stemming involves reducing words to their fundamental or root form, referred to as the “stem.” This process is commonly used in NLP to prepare text for analysis, retrieval, or storage. Stemming algorithms work by cutting off the ends or suffixes of words, leaving only the stem.
词干提取的目标是将单词的所有变形或派生形式转换为通用的基本形式。例如,单词“running”的词干是“run”,单词“runs”的词干也是“run”。
The goal of stemming is to convert all inflected or derived forms of a word into a common base form. For example, the stem of the word “running” is “run,” and the stem of the word “runs” is also “run.”
一种常用的词干算法是 Porter 词干算法。该算法基于一系列识别后缀并将其从单词中删除以获得词干的规则。例如,波特算法会通过删除“ ing”后缀将单词“leaping”转换为“leap”。
One commonly used stemming algorithm is the Porter stemming algorithm. This algorithm is based on a series of rules that identify suffixes and remove them from words to obtain the stem. For example, the Porter algorithm would convert the word “leaping” into “leap” by removing the “ing” suffix.
让我们看一个示例句子来了解词干提取的实际效果:
Let’s look at an example sentence to see stemming in action:
他们在墙上奔跑和跳跃
They are running and leaping across the walls
这是截取的文本(使用波特算法):
Here’s the stemmed text (using the Porter algorithm):
他们跑着、跳着越过墙
They are run and leap across the wall
正如您所看到的,单词“running”和“leaping”已分别转换为其基本形式“run”和“leap”,并且后缀“s”已从“walls”中删除。
As you can see, the words “running” and “leaping” have been converted into their base forms of “run” and “leap,” respectively, and the suffix “s” has been removed from “walls.”
词干对于信息检索或情感分析等文本分析任务非常有用,因为它减少了文档或语料库中唯一单词的数量,并有助于对相似的单词进行分组。然而,词干提取也会引入错误,因为它有时会产生不是实际单词的词干,或者产生不是单词的预期基本形式的词干。例如,词干分析器可能会生成“walk”作为“walked”和“walking”的词干,尽管“walk”和“walked”具有不同的含义。含义。因此,评估词干提取的结果非常重要,以确保它产生准确且有用的结果。
Stemming can be useful for text analysis tasks such as information retrieval or sentiment analysis as it reduces the number of unique words in a document or corpus and can help to group similar words. However, stemming can also introduce errors as it can sometimes produce stems that are not actual words or produce stems that are not the intended base form of the word. For example, the stemmer might produce “walk” as the stem for both “walked” and “walking,” even though “walk” and “walked” have different meanings. Therefore, it’s important to evaluate the results of stemming to ensure that it is producing accurate and useful results.
NER 是一种 NLP 技术,其设计检测文本中的命名实体并对其进行分类,包括但不限于人名、组织名称、位置等。NER 的主要目标是从非结构化文本数据中自主识别和提取有关这些命名实体的信息。
NER is an NLP technique that’s designed to detect and categorize named entities within text, including but not limited to person’s names, organization’s names, locations, and more. NER’s primary objective is to autonomously identify and extract information about these named entities from unstructured text data.
NER 通常涉及使用机器学习模型,例如条件随机场( CRF ) 或循环神经网络( RNN ),来标记给定句子中的单词及其相应的实体类型。这些模型在包含带标记实体的文本的大型带注释数据集上进行训练。然后,这些模型使用基于上下文的规则来识别新文本中的命名实体。
NER typically involves using machine learning models, such as conditional random fields (CRFs) or recurrent neural networks (RNNs), to tag words in a given sentence with their corresponding entity types. The models are trained on large annotated datasets that contain text with labeled entities. These models then use context-based rules to identify named entities in new text.
There are several categories of named entities that can be identified by NER, including the following:
Here’s an example of how NER works. Take a look at the following sentence:
苹果公司是一家总部位于加利福尼亚州库比蒂诺的科技公司。
Apple Inc. is a technology company headquartered in Cupertino, California.
在这里,NER 将识别“Apple Inc.”。作为一个组织,“加利福尼亚州库比蒂诺”作为一个位置。NER 系统的输出可以是句子的结构化表示,如下所示:
Here, NER would identify “Apple Inc.” as an organization and “Cupertino, California” as a location. The output of an NER system could be a structured representation of the sentence, as shown here:
{“组织”:“苹果公司”,
“位置”:“加利福尼亚州库比蒂诺”}
{"organization": "Apple Inc.",
"location": "Cupertino, California"} NER有很多应用应用于各个领域,包括信息检索、问答、情感分析等。它可用于从非结构化文本数据中自动提取结构化信息,这些信息可以进一步分析或用于下游任务。
NER has many applications in various fields, including information retrieval, question-answering, sentiment analysis, and more. It can be used to automatically extract structured information from unstructured text data, which can be further analyzed or used for downstream tasks.
执行 NER 有不同的方法和工具,但执行 NER 时的一般步骤如下:
There are different approaches and tools to perform NER, but the general steps when performing NER are as follows:
以下是如何执行NER 的示例:
Here’s an example of how NER can be performed:
原文:
Original text:
苹果今年正在谈判收购一家中国初创企业。
Apple is negotiating to buy a Chinese start-up this year.
预处理文本:
Preprocessed text:
苹果洽谈收购中国初创企业年
apple negotiating buy Chinese start-up year
标记文本:
Tagged text:
B-ORG OO B-LOC OO
B-ORG O O B-LOC O O
在此示例中,名为实体“Apple”和“Chinese”分别被标识为组织(B-ORG)和位置(B-LOC)。在本例中,“今年”未被识别为命名实体,但如果使用更复杂的标记方案,或者模型接受了能够促进这一点的数据训练,则可能会被识别为命名实体。
In this example, the named entities “Apple” and “Chinese” are identified as an organization (B-ORG) and a location (B-LOC), respectively. “this year” is not recognized as a named entity in this example, but it would be if a more complex tagging scheme is used or if the model is trained on data that would promote that.
根据编程语言和项目的具体需求,可以使用多个库进行 NER。我们来看看一些常用的库:
Several libraries can be used for NER, depending on the programming language and specific needs of the project. Let’s take a look at some commonly used libraries:
还有许多其他库可用于 NER,库的选择将取决于编程语言、可用模型和项目的具体要求等因素。在下一节中,我们将解释词性标记和执行此任务的不同方法。
There are many other libraries available for NER, and the choice of library will depend on factors such as the programming language, available models, and specific requirements of the project. In the next section, we will explain POS tagging and different methods to perform this task.
词性标注是归因的实践句子中各个单词的语法标签,例如名词、动词、形容词等。此标记过程作为各种 NLP 任务(包括文本分类、情感分析和机器翻译)的基础步骤具有重要意义。
POS tagging is the practice of attributing grammatical labels, such as nouns, verbs, adjectives, and others, to individual words within a sentence. This tagging process holds significance as a foundational step in various NLP tasks, including text classification, sentiment analysis, and machine translation.
词性标注可以使用不同的方法来执行,例如基于规则的方法、统计方法和基于深度学习的方法。在本节中,我们将简要概述每种方法。
POS tagging can be performed using different approaches such as rule-based methods, statistical methods, and deep learning-based methods. In this section, we’ll provide a brief overview of each approach.
基于规则的方法词性标注涉及定义一组规则或模式,可用于自动标记文本中的单词及其相应的词性,例如名词、动词、形容词等。
Rule-based methods for POS tagging involve defining a set of rules or patterns that can be used to automatically tag words in a text with their corresponding parts of speech, such as nouns, verbs, adjectives, and so on.
该过程涉及定义一组规则或模式来识别句子中的不同词性。例如,一条规则可能规定任何以“-ing”结尾的单词都是动名词(充当名词的动词),而另一条规则可能规定任何冠词前面的单词(例如“a”或“an”)都是动名词。可能是一个名词。
The process involves defining a set of rules or patterns for identifying the different parts of speech in a sentence. For example, a rule may state that any word ending in “-ing” is a gerund (a verb acting as a noun), while another rule may state that any word preceded by an article such as “a” or “an” is likely a noun.
这些规则通常基于语言知识,例如语法和句法知识,并且通常特定于特定语言。它们还可以补充词汇或词典,提供有关单词的含义和用法的附加信息。
These rules are typically based on linguistic knowledge, such as knowledge of grammar and syntax, and are often specific to a particular language. They can also be supplemented with lexicons or dictionaries that provide additional information about the meanings and usage of words.
基于规则的标记过程涉及将这些规则应用于给定文本并识别每个单词的词性。这可以手动完成,但通常使用支持正则表达式和模式匹配的软件工具和编程语言自动完成。
The process of rule-based tagging involves applying these rules to a given text and identifying the parts of speech for each word. This can be done manually but is typically automated using software tools and programming languages that support regular expressions and pattern matching.
基于规则的方法的优点之一是,当规则设计良好并涵盖广泛的语言现象时,它们可以非常准确。它们还可以针对特定领域或文本类型进行定制,例如科学文献或法律文档。
One advantage of rule-based methods is that they can be highly accurate when the rules are well-designed and cover a wide range of linguistic phenomena. They can also be customized to specific domains or genres of text, such as scientific literature or legal documents.
然而,基于规则的方法的一个局限性是它们可能无法捕获自然语言的全部复杂性和可变性,并且随着语言随着时间的推移而发展和变化,可能需要付出巨大的努力来开发和维护规则。他们还可能会遇到歧义,例如一个单词根据上下文可能有多个可能的词性。
However, one limitation of rule-based methods is that they may not be able to capture the full complexity and variability of natural language, and may require significant effort to develop and maintain the rules as language evolves and changes over time. They may also struggle with ambiguity, such as in cases where a word can have multiple possible parts of speech depending on the context.
尽管有这些限制,基于规则的方法POS 标签保留NLP 中的一种重要方法,特别是对于需要高精度和高精度的应用。
Despite these limitations, rule-based methods for POS tagging remain an important approach in NLP, especially for applications that require high accuracy and precision.
词性标注的统计方法基于使用概率模型自动为句子中的每个单词分配最可能的 POS 标签。这些方法依赖于标记文本的训练语料库(其中 POS 标签已分配给单词)来学习与每个标签关联的特定单词的概率。
Statistical methods for POS tagging are based on using probabilistic models to automatically assign the most likely POS tag to each word in a sentence. These methods rely on a training corpus of tagged text, where the POS tags have already been assigned to the words, to learn the probabilities of a particular word being associated with each tag.
统计的两种主要类型用于词性标注的方法有:隐马尔可夫模型( HMM )和条件随机场 (CRF)。
Two main types of statistical methods are used for POS tagging: Hidden Markov Models (HMMs) and CRFs.
HMM 是一类概率模型,广泛应用于处理序列数据(包括文本)。在 POS 标签的上下文中,HMM 表示涉及单词序列的 POS 标签序列的概率分布。HMM 假设句子中特定位置处的 POS 标签的可能性仅取决于序列中前面的标签。此外,他们假设特定单词在给定标签的情况下的可能性仍然独立于句子中的其他单词。为了识别给定句子最可能的 POS 标签序列,HMM 采用维特比算法。
HMMs serve as a category of probabilistic models that are extensively applied in handling sequential data, including text. In the context of POS tagging, HMMs represent the probability distribution of a sequence of POS tags concerning a sequence of words. HMMs assume that the likelihood of a POS tag at a specific position within a sentence is contingent solely upon the preceding tag in the sequence. Furthermore, they presume that the likelihood of a particular word, given its tag, remains independent of other words within the sentence. To identify the most probable sequence of POS tags for a given sentence, HMMs employ the Viterbi algorithm.
CRF 是另一种类型的概率模型,常用于序列标记任务,包括词性标记。CRF 与 HMM 的不同之处在于,它们对给定输入序列(即单词)的输出序列(即 POS 标签)的条件概率进行建模,而不是对输出序列和输入序列的联合概率进行建模。这使得 CRF 能够捕获比 HMM 更复杂的输入和输出序列之间的依赖关系。CRF 使用迭代算法(例如梯度下降或 L-BFGS)来学习模型的最佳权重集。
CRFs are another type of probabilistic model that is commonly used for sequence labeling tasks, including POS tagging. CRFs differ from HMMs in that they model the conditional probability of the output sequence (that is, the POS tags) given the input sequence (that is, the words), rather than the joint probability of the output and input sequences. This allows CRFs to capture more complex dependencies between the input and output sequences than HMMs. CRFs use an iterative algorithm, such as gradient descent or L-BFGS, to learn the optimal set of weights for the model.
Let’s look at the advantages of statistical methods:
Now, let’s look at the disadvantages:
基于深度学习的方法用于 POS 标记涉及训练神经网络模型来预测给定句子中每个单词的词性标签。这些方法可以学习文本数据中的复杂模式和关系,以便用适当的词性准确地标记单词。
Deep learning-based methods for POS tagging involve training a neural network model to predict the POS tags for each word in a given sentence. These methods can learn complex patterns and relationships in the text data to accurately tag words with their appropriate parts of speech.
最流行的基于深度学习的 POS 标记方法之一是使用带有 LSTM 单元的 RNN。基于 LSTM 的模型可以处理单词序列并捕获它们之间的依赖关系。模型的输入是单词嵌入序列,它们是高维空间中单词的向量表示。这些嵌入是在训练过程中学习的。
One of the most popular deep learning-based methods for POS tagging is using an RNN with LSTM cells. LSTM-based models can process sequences of words and capture dependencies between them. The input to the model is a sequence of word embeddings, which are vector representations of words in a high-dimensional space. These embeddings are learned during the training process.
基于 LSTM 的模型由三个主要层组成:输入层、LSTM 层和输出层。该结构涉及将词嵌入作为输入层的输入。随后,LSTM 层处理这些嵌入的序列,旨在掌握它们内在的相互依赖性。最终,输出层负责预测输入序列中每个单词的 POS 标签。另一种流行的基于深度学习的词性标注方法是使用基于 Transformer 的模型,例如Transformers 的双向编码器表示( BERT )。BERT 是一种经过预先训练的语言模型,采用基于 Transformer 的架构来深入理解句子中单词之间的上下文关系。它接受了大量文本数据的训练,并且可以进行微调,以在各种 NLP 任务中表现出色,其中之一就是词性标注。
The LSTM-based model is comprised of three main layers: an input layer, an LSTM layer, and an output layer. The structure involves taking word embeddings as input into the input layer. Subsequently, the LSTM layer processes the sequence of these embeddings, aiming to grasp the interdependencies inherent within them. Ultimately, the output layer is responsible for predicting the POS tag for each word within the input sequence. Another popular deep learning-based method for POS tagging is using a transformer-based model, such as Bidirectional Encoder Representations from Transformers (BERT). BERT is a language model that comes pre-trained and employs a transformer-based architecture to acquire a profound understanding of contextual relationships among words within a sentence. It undergoes training with vast quantities of text data and can be fine-tuned to excel in diverse NLP tasks, one of which is POS tagging.
要使用 BERT 进行 POS 标记,必须对输入句子进行标记,并且必须为每个标记分配一个初始 POS 标记。然后,将令牌嵌入输入到预先训练的 BERT 模型中,该模型会输出每个令牌的上下文嵌入。这些嵌入通过前馈神经网络来预测每个标记的最终 POS 标签。
To use BERT for POS tagging, the input sentence must be tokenized, and each token must be assigned an initial POS tag. The token embeddings are then fed into the pre-trained BERT model, which outputs contextualized embeddings for each token. These embeddings are passed through a feedforward neural network to predict the final POS tag for each token.
用于词性标记的深度学习方法已在众多基准数据集上展示了领先的性能。尽管如此,它们的有效性需要大量的训练数据和计算资源,并且训练过程可能非常耗时。此外,它们可能缺乏可解释性,这使得很难理解模型如何做出预测。
Deep learning approaches for POS tagging have demonstrated leading-edge performance across numerous benchmark datasets. Nonetheless, their effectiveness demands substantial training data and computational resources, and the training process can be time-consuming. Moreover, they may suffer from a lack of interpretability, which makes it difficult to understand how the model is making its predictions.
有几个库可用用于在各种编程中执行 POS 标记语言,包括 Python、Java 和 C++。一些提供 POS 标记功能的流行 NLP 库包括 NLTK、spaCy、Stanford CoreNLP 和Apache OpenNLP。
Several libraries are available for performing POS tagging in various programming languages, including Python, Java, and C++. Some popular NLP libraries that provide POS tagging functionality include NLTK, spaCy, Stanford CoreNLP, and Apache OpenNLP.
以下是在 Python 中使用 NLTK 库进行词性标记的示例:
Here is an example of POS tagging using the NLTK library in Python:
导入nltk
input_sentence = "年轻的白猫跳过了懒狗"
processed_tokens = nltk.word_tokenize(input_sentence)
标签 = nltk.pos_tag(processed_tokens)
打印(标签)
import nltk
input_sentence = "The young white cat jumps over the lazy dog"
processed_tokens = nltk.word_tokenize(input_sentence)
tags = nltk.pos_tag(processed_tokens)
print(tags) 输出如下:
The output is as follows:
[('The', 'DT'), (young, 'JJ'), (white, 'NN'), ('cat', 'NN'), ('jumps', 'VBZ'), ('over ', 'IN'), ('the', 'DT'), ('懒', 'JJ'), ('狗', 'NN')]
[('The', 'DT'), (young, 'JJ'), (white, 'NN'), ('cat', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')] 在这个例子中,nltk.pos_tag()函数用于标记句子中的单词。该函数返回元组列表其中每个元组包含一个单词及其 POS 标签。销售点此处使用的标签基于Penn Treebank 标签集。
In this example, the nltk.pos_tag() function is used to tag the words in the sentence. The function returns a list of tuples where each tuple contains a word and its POS tag. The POS tags that have been used here are based on the Penn Treebank tagset.
正则表达式是一种类型文字图案的在现代编程语言和软件中有各种应用。它们可用于验证输入是否符合特定文本模式、在与模式匹配的较大文本正文中定位文本、用替代文本替换与模式匹配的文本或重新排列匹配文本的部分以及划分文本块到潜台词列表中,但如果使用不当可能会导致意想不到的后果。
A regular expression is a type of text pattern that has various applications in modern programming languages and software. They are useful for validating whether an input conforms to a particular text pattern, locating text within a larger text body that matches the pattern, replacing text that matches the pattern with alternative text or rearranging parts of the matched text, and dividing a block of text into a list of subtexts, but can cause unintended consequences if used incorrectly.
在计算机科学和数学中,正则表达式一词源自数学表达式中的“正则性”概念。
In computer science and mathematics, the term regular expression is derived from the concept of “regularity” in mathematical expressions.
正则表达式,通常称为regex或regexp,是一系列构成搜索模式的字符。正则表达式用于匹配和操作文本,通常在文本处理、搜索算法和 NLP 的上下文中。
A regular expression, often referred to as regex or regexp, is a series of characters that constitutes a search pattern. Regular expressions are used to match and manipulate text, typically in the context of text processing, search algorithms, and NLP.
正则表达式由字符和元字符的混合组成,它们共同建立了在文本字符串中搜索的模式。正则表达式的最简单形式只是必须精确匹配的字符序列。例如,正则表达式“hello”将匹配顺序包含字符“hello”的任何字符串。
A regular expression comprises a mix of characters and metacharacters, which collectively establish a pattern to search for within a text string. The simplest form of a regular expression is a mere sequence of characters that must be matched precisely. For example, the regular expression “hello” would match any string that contains the characters “hello” in sequence.
元字符是正则表达式中具有预定义含义的独特字符。例如,“.” (点)元字符用于匹配任何单个字符,而“*”(星号)元字符用于匹配前面的字符或组的零个或多个实例。正则表达式可用于广泛的文本处理任务。让我们仔细看看。
Metacharacters are unique characters within regular expressions that possess pre-defined meanings. For instance, the “.” (dot) metacharacter is employed to match any individual character, whereas the “*” (asterisk) metacharacter is used to match zero or more instances of the preceding characters or group. Regular expressions can be used for a wide range of text-processing tasks. Let’s take a closer look.
可以使用正则表达式验证输入通过将其与模式进行匹配。例如,您可以使用正则表达式来验证电子邮件地址或电话号码。
Regular expressions can be used to validate input by matching it against a pattern. For example, you can use a regular expression to validate an email address or a phone number.
使用正则表达式进行文本操作涉及使用用于查找和操作文档或数据集中的文本字符串的模式匹配技术。正则表达式是处理文本数据的强大工具,允许复杂的搜索和替换操作、文本提取和格式化。
Text manipulation using regular expressions involves using pattern-matching techniques to find and manipulate text strings in a document or dataset. Regular expressions are powerful tools for working with text data, allowing for complex search and replace operations, text extraction, and formatting.
可以使用正则表达式完成的一些常见文本操作任务如下:
Some common text manipulation tasks that can be accomplished with regular expressions are as follows:
Here are the general steps for using regular expressions for data extraction:
下面是如何在 Python 中使用正则表达式从字符串中提取所有电子邮件地址的示例:
Here’s an example of how to extract all email addresses from a string using regular expressions in Python:
进口重新
text = "John 的电子邮件是 john@example.com,Jane 的电子邮件是 jane@example.com"
# 电子邮件地址模式:
模式 = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[AZ|az]{2,}\b'
正则表达式 = re.compile(模式)
# 搜索文本中所有出现的模式:
匹配 = regex.findall(text)
打印(匹配)
import re
text = "John's email is john@example.com and Jane's email is jane@example.com"
# Pattern for email addresses:
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
regex = re.compile(pattern)
# Search for all occurrences of the pattern in the text:
matches = regex.findall(text)
print(matches)
['john@example.com', 'jane@example.com']
['john@example.com', 'jane@example.com'] 接下来,我们将介绍文本清理。
Next, we’ll cover text cleaning.
文本清理手段使用正则表达式清理和标准化文本数据,从而删除不需要的字符、空格或其他格式。
Text cleaning means using regular expressions to clean and standardize text data, thereby removing unwanted characters, whitespace, or other formatting.
Here are some common text-cleaning techniques that use regular expressions:
通过使用正则表达式对于文本清理,您可以从文本中删除噪音和不相关的信息,从而更轻松地分析和提取有意义的见解。
By using regular expressions for text cleaning, you can remove noise and irrelevant information from text, making it easier to analyze and extract meaningful insights.
解析涉及分析文本字符串来辨别其语法根据指定的语法进行结构。正则表达式是文本解析的有效工具,特别是在处理不复杂且规则的语法模式时。
Parsing involves analyzing a text string to discern its grammatical structure according to a specified grammar. Regular expressions serve as potent instruments for text parsing, especially when dealing with uncomplicated and regular grammatical patterns.
要使用正则表达式解析文本,您需要为要解析的语言定义语法。语法应指定句子的可能组成部分,例如名词、动词、形容词等,以及规定如何组合这些组成部分以形成有效句子的规则。
To parse text using regular expressions, you need to define a grammar for the language you want to parse. The grammar should specify the possible components of a sentence, such as nouns, verbs, adjectives, and so on, as well as the rules that dictate how these components can be combined to form valid sentences.
定义语法后,您可以使用正则表达式来识别句子的各个组成部分以及它们之间的关系。例如,您可以使用正则表达式来匹配句子中的所有名词或识别动词的主语和宾语。
Once you have defined the grammar, you can use regular expressions to identify the individual components of a sentence and the relationships between them. For example, you can use regular expressions to match all the nouns in a sentence or to identify the subject and object of a verb.
使用正则表达式进行解析的一种常见方法是定义一组与语法中不同词性和句子结构相对应的模式。例如,您可以定义一个用于匹配名词的模式、一个用于匹配动词的模式,以及一个用于匹配由主语后跟动词和宾语组成的句子的模式。
One common approach to parsing with regular expressions is to define a set of patterns that correspond to the different parts of speech and sentence structures in your grammar. For example, you might define a pattern for matching nouns, a pattern for matching verbs, and a pattern for matching sentences that consist of a subject followed by a verb and an object.
要使用这些模式进行解析,您可以使用正则表达式引擎将它们应用于文本字符串,该引擎会将模式与字符串的适当部分进行匹配。解析过程的输出将是解析树或表示句子语法结构的其他数据结构。
To use these patterns for parsing, you would apply them to a text string using a regular expression engine, which would match the patterns to the appropriate parts of the string. The output of the parsing process would be a parse tree or other data structure that represents the grammatical structure of the sentence.
正则表达式解析的局限性之一是它通常不适合处理更复杂或模糊的语法。例如,根据上下文,单词可能是名词或动词,或者句子结构不明确的情况可能很难处理。
One limitation of regular expression parsing is that it is generally not suitable for handling more complex or ambiguous grammar. For example, it can be difficult to handle cases where a word could be either a noun or a verb depending on the context, or where the structure of a sentence is ambiguous.
我们还可以使用正则表达式根据特定模式或分隔符将较大的文本文档分解为较小的块或标记。
We can also use regular expressions to break a larger text document into smaller chunks or tokens based on specific patterns or delimiters.
要使用正则表达式进行文本操作,通常需要定义与要查找或操作的文本相匹配的模式。此模式可以包含特殊字符和语法来定义组成文本字符串的字符、数字或其他元素的特定序列。
To use regular expressions for text manipulation, you typically need to define a pattern that matches the text you want to find or manipulate. This pattern can include special characters and syntax to define the specific sequence of characters, numbers, or other elements that make up the text string.
例如,正则表达式模式\d{3}-\d{2}-\d{4}可用于在较大的文本文档中搜索和提取社会安全号码。此模式匹配三个数字的序列,后跟破折号,然后是另外两个数字,另一个破折号,最后四个数字,后跟一个非数字,它们一起代表美国社会安全号码的标准格式。
For example, the regular expression pattern \d{3}-\d{2}-\d{4} might be used to search for and extract Social Security numbers in a larger text document. This pattern matches a sequence of three digits, followed by a dash, then two more digits, another dash, and four final digits followed by a non-digit, which together represent the standard format for a Social Security number in the USA.
定义正则表达式模式后,您可以将其与各种文本操作工具和编程语言(例如 grep、sed、awk、Perl、Python 等)一起使用,以执行复杂的文本操作任务。
Once you have defined your regular expression pattern, you can use it with various text manipulation tools and programming languages, such as grep, sed, awk, Perl, Python, and many others, to perform complex text manipulation tasks.
某些编程语言(例如 Perl 和 Python)内置了对正则表达式的支持。其他编程语言(例如 Java 和 C++)要求您使用库或 API 来处理正则表达式。
Some programming languages, such as Perl and Python, have built-in support for regular expressions. Other programming languages, such as Java and C++, require you to use a library or API to work with regular expressions.
虽然正则表达式是强大的工具对于文本处理,它们也可能很复杂且难以理解。熟悉正则表达式的语法和行为对于在代码中有效地使用它们非常重要。
While regular expressions are powerful tools for text processing, they can also be complex and difficult to understand. It’s important to be familiar with the syntax and behavior of regular expressions to use them effectively in your code.
标记化是 NLP 中的一个过程这涉及将一段文本或一个句子分解为单独的单词或术语,称为标记。标记化过程可以应用于各种形式的数据,例如文本文档、社交媒体帖子、网页等。
Tokenization is a process in NLP that involves breaking down a piece of text or a sentence into individual words or terms, known as tokens. The tokenization process can be applied to various forms of data, such as textual documents, social media posts, web pages, and more.
标记化过程是许多 NLP 任务中重要的初始步骤,因为它将非结构化文本数据转换为可以使用机器学习算法或其他技术进行分析的结构化格式。这些标记可用于在文本中执行各种操作,例如计算词频、识别最常见的短语等。
The tokenization process is an important initial step in many NLP tasks as it transforms unstructured text data into a structured format that can be analyzed using machine learning algorithms or other techniques. These tokens can be used to perform various operations in the text, such as counting word frequencies, identifying the most common phrases, and so on.
有不同的标记化方法:
There are different methods of tokenization:
敏捷的白猫跳过了昏昏欲睡的狗
这可以被标记为以下单词列表:
[“那个”、“敏捷”、“白色”、“猫”、“跳跃”、“越过”、“那个”、“困倦”、“狗”]
The nimble white cat jumps over the sleepy dog
This can be tokenized into the following list of words:
[“The”, “nimble”, “white”, “cat”, “jumps”, “over”, “the”, “sleepy”, “dog”]
这是第一句话。
这是第二句话。
这是第三句话。
这可以被标记为以下句子列表:
[“这是第一句话。”,
“这是第二句话。”,
“这是第三句话。”]
This is the first sentence.
This is the second sentence.
This is the third sentence.
This can be tokenized into the following list of sentences:
[“This is the first sentence.”,
“This is the second sentence.”,
“This is the third sentence.”]
分词是 NLP 中的重要一步,在许多应用中都有使用,例如情感分析、文档分类、机器翻译等。
Tokenization is an important step in NLP and is used in many applications, such as sentiment analysis, document classification, machine translation, and more.
标记化也是语言模型的重要一步。例如,在众所周知的语言模型 BERT 中,分词器是子词分词器,这意味着它将单词分解为较小的子词单元称为标记。它使用WordPiece标记化,这是一种数据驱动的基于正在训练的文本语料库构建大量子词词汇的方法。
Tokenization is also an important step in language models. For example, in BERT, which is a well-known language model, a tokenizer is a sub-word tokenizer, meaning it breaks down words into smaller sub-word units called tokens. It uses WordPiece tokenization, which is a data-driven approach that builds a large vocabulary of sub-words based on the corpus of text being trained on.
使用分词器也是语言模型中的重要一步。例如,BERT 使用 WordPiece 分词器,该分词器采用将单词分为完整形式的技术或称为单词片段的较小组件。这意味着单个单词可以由多个标记表示。它采用数据驱动的方法,根据正在训练的文本语料库构建大量子词词汇。这些子词单元表示为嵌入,用作BERT 模型的输入。
Using a tokenizer is an important step in language models as well. For example, BERT utilizes a WordPiece tokenizer, which employs the technique of dividing words into either their full forms or smaller components known as word pieces. This means that a single word can be represented by several tokens. It employs a data-driven approach that builds a large vocabulary of sub-words based on the corpus of text being trained on. These sub-word units are represented as embeddings that are used as input to the BERT model.
BERT 的主要特征之一tokenizer 的优点是它可以处理词汇外(OOV)单词。如果分词器遇到不在其词汇表中的单词,它将将该单词分解为子词,并将该单词表示为其子词嵌入的组合。我们将在本书后面更详细地解释 BERT 及其分词器。在语言模型中使用分词器的好处是,我们可以将输入的数量限制为字典的大小,而不是所有可能的输入。例如,BERT 的词汇量为 30,000 个单词,这有助于我们限制深度学习语言的大小模型。使用更大的分词器会增加模型的大小。在下一节中,我们将解释如何在完整的预处理管道中使用本章介绍的方法。
One of the key features of the BERT tokenizer is that it can handle out-of-vocabulary (OOV) words. If the tokenizer encounters a word that is not in its vocabulary, it will break the word down into sub-words and represent the word as a combination of its sub-word embeddings. We will explain BERT and its tokenizer in more detail later in this book. The benefit of using a tokenizer in language models is that we can limit the number of inputs to the size of our dictionary rather than all possible inputs. For example, BERT has a 30,000-word vocabulary size, which helps us limit the size of the deep learning language model. Using a bigger tokenizer will increase the size of the model. In the next section, we will explain how to use the methods that were covered in this chapter in a complete preprocessing pipeline.
We will explain a complete preprocessing pipeline that has been provided by the authors to you, the reader.
如下面的代码所示,输入是带有编码标签的格式化文本,类似于我们从 HTML网页中提取的内容:
As shown in the following code, the input is a formatted text with encoded tags, similar to what we can extract from HTML web pages:
“<SUBJECT LINE> 员工详细信息<END><BODY TEXT>随附 2 个文件,\n第一个是pairoll,第二个是 healthcare!<END>”
"<SUBJECT LINE> Employees details<END><BODY TEXT>Attached are 2 files,\n1st one is pairoll, 2nd is healtcare!<END>" 我们来看看将每一步应用到文本上的效果:
Let’s take a look at the effect of applying each step to the text:
员工详细信息。附件有2个文件,第一个是pairoll,第二个是healcare!
Employees details. Attached are 2 files, 1st one is pairoll, 2nd is healtcare!
员工详细信息。附件是 2 个文件,第一个是pairoll,第二个是healcare!
employees details. attached are 2 files, 1st one is pairoll, 2nd is healtcare!
员工详细信息。附件是两个文件,第一个是pairoll,第二个是healcare!
employees details. attached are two files, first one is pairoll, second is healtcare!
随附的员工详细信息有两个文件,第一个是pairoll,第二个是医疗保健
employees details attached are two files first one is pairoll second is healtcare
附加的员工详细信息有两个文件,第一个是工资单,第二个是医疗保健
employees details attached are two files first one is payroll second is healthcare
员工详细信息附有两个文件,第一个是工资单,第二个是医疗保健
employees details attached two files first one payroll second healthcare
员工详细信息附上两个文件第一个工资单第二个医疗卡
employe detail attach two file first one payrol second healthcar
员工详细信息附上两个文件第一个工资单第二个医疗卡
employe detail attach two file first one payrol second healthcar
至此,我们了解到关于不同的预处理方法。接下来,我们将回顾一段用于执行 NER和 POS 的代码。
With that, we’ve learned about different preprocessing methods. Next, we’ll review a piece of code for performing NER and POS.
对于这个例子,我们使用了Python 的 spaCy 库来执行这些任务。这里我们的输入是:
For this example, we used the spaCy library for Python to perform these tasks. Here our input is:
明天将发布季度报告的公司是下午 4 点微软、下午 4 点谷歌和下午 6 点 AT&T。
The companies that would be releasing their quarterly reports tomorrow are Microsoft, 4pm, Google, 4pm, and AT&T, 6pm. 这是NER 的输出:
Here’s the output for NER:
将于明天 DATE 发布季度报告的公司是 Microsoft ORG(时间下午 4 点)、Google ORG(时间下午 4 点)和 AT&T ORG(时间下午 6 点)。
The companies that would be releasing their quarterly DATE reports tomorrow DATE are Microsoft ORG , 4pm TIME , Google ORG , 4pm TIME , and AT&T ORG , 6pm TIME .
正如您所看到的,使用 NER,我们能够检测到句子中与公司名称 (ORG)或日期相关的部分。
As you can see, using NER, we were able to detect parts of the sentence that are related to company names (ORG) or dates.
图 4 .1显示了执行POS 标记的示例
Figure 4.1 shows an example of performing POS tagging:
图 4.1 – 使用 spaCy 进行词性标记
Figure 4.1 – POS tagging using spaCy
这是输出:
Here’s the output:
[['公司', '名词'],
['释放', '动词'],
['季度', 'ADJ'],
['报告','名词'],
['明天', '名词'],
['微软','PROPN'],
['下午', '名词'],
['谷歌','PROPN'],
['下午', '名词'],
['AT&T', 'PROPN'],
['下午', '名词']]
[['companies', 'NOUN'],
['releasing', 'VERB'],
['quarterly', 'ADJ'],
['reports', 'NOUN'],
['tomorrow', 'NOUN'],
['Microsoft', 'PROPN'],
['pm', 'NOUN'],
['Google', 'PROPN'],
['pm', 'NOUN'],
['AT&T', 'PROPN'],
['pm', 'NOUN']] 前面的代码示例举例说明各种预处理方面,处理原始文本并将其转换为适合下游模型的形式,从而适合整体设计的目的。
The preceding code examples exemplify the various aspects of preprocessing, which processes raw text and transforms it into a form that suits the downstream model so that it suits the purpose of the overall design.
在本章中,我们介绍了一系列文本预处理的技术和方法,包括规范化、标记化、停用词删除、词性标记等。我们探索了这些技术的不同方法,例如基于规则的方法、统计方法和基于深度学习的方法。我们还讨论了每种方法的优点和缺点,并提供了示例和代码片段来说明它们的使用。
In this chapter, we covered a range of techniques and methods for text preprocessing, including normalization, tokenization, stop word removal, POS tagging, and more. We explored different approaches to these techniques, such as rule-based methods, statistical methods, and deep learning-based methods. We also discussed the advantages and disadvantages of each method and provided examples and code snippets to illustrate their use.
此时,您应该充分了解文本预处理的重要性以及可用于清理和准备分析文本数据的各种技术和方法。您应该能够使用 Python 中的流行库和框架来实现这些技术,并了解不同方法之间的权衡。此外,您应该更好地了解如何处理文本数据,以便在情感分析、主题建模和文本分类等 NLP 任务中获得更好的结果。
At this point, you should have a solid understanding of the importance of text preprocessing and the various techniques and methods available for cleaning and preparing text data for analysis. You should be able to implement these techniques using popular libraries and frameworks in Python and understand the trade-offs between different approaches. Furthermore, you should have a better understanding of how to process text data to achieve better results in NLP tasks such as sentiment analysis, topic modeling, and text classification.
在下一章中,我们将解释文本分类以及执行此任务的不同方法。
In the next chapter, we will explain text classification, and different methods for performing this task.
在本章中,我们将深入研究文本分类的迷人世界,这是自然语言处理(NLP)和机器学习(ML)中的一项基本任务,涉及将文本文档分类为预定义的类别。作为随着数字文本数据量持续呈指数级增长,准确有效地分类文本的能力对于情感分析、垃圾邮件检测和文档组织等广泛的应用变得越来越重要。本章全面概述了文本分类中使用的关键概念、方法和技术,以满足不同背景和技能水平的读者的需求。
In this chapter, we’ll delve into the fascinating world of text classification, a foundational task in natural language processing (NLP) and machine learning (ML) that deals with categorizing text documents into predefined classes. As the volume of digital text data continues to grow exponentially, the ability to accurately and efficiently classify text has become increasingly important for a wide range of applications, such as sentiment analysis, spam detection, and document organization. This chapter provides a comprehensive overview of the key concepts, methodologies, and techniques that are employed in text classification, catering to readers from diverse backgrounds and skill levels.
我们将首先探索各种类型的文本分类任务及其独特的特征,深入了解每种类型所带来的挑战和机遇。接下来我们就来介绍一下N-gram的概念并讨论如何将它们用作文本分类的特征,不仅捕获单个单词还包括文本中的本地上下文和单词序列。然后,我们将研究广泛使用的术语频率-逆文档频率( TF-IDF ) 方法,该方法根据单词在文档和整个语料库中的频率为单词分配权重,展示其在区分分类任务的相关单词方面的有效性。
We’ll begin by exploring the various types of text classification tasks and their unique characteristics, offering insights into the challenges and opportunities each type presents. Next, we’ll introduce the concept of N-grams and discuss how they can be utilized as features for text classification, capturing not only individual words but also the local context and word sequences within the text. We’ll then examine the widely used term frequency-inverse document frequency (TF-IDF) method, which assigns weights to words based on their frequency in a document and across the entire corpus, showcasing its effectiveness in distinguishing relevant words for classification tasks.
接下来,我们将深入探讨深入了解强大的Word2Vec算法及其在文本分类中的应用。我们将讨论Word2Vec如何创建捕获语义和关系的单词的密集向量表示,以及如何将这些嵌入用作特征来提高分类性能。此外,我们将涵盖流行的架构,例如连续词袋(CBOW )和Skip-Gram,提供对其内部工作原理的更深入理解。
Following that, we’ll delve into the powerful Word2Vec algorithm and its application in text classification. We’ll discuss how Word2Vec creates dense vector representations of words that capture semantic meaning and relationships, and how these embeddings can be used as features to improve classification performance. Furthermore, we’ll cover popular architectures such as continuous bag-of-words (CBOW) and Skip-Gram, providing a deeper understanding of their inner workings.
最后,我们将探讨主题建模的概念,这是一种发现隐藏主题的技术文档集合中的结构。我们将研究流行的算法,例如潜在狄利克雷分配( LDA ),并描述如何将主题建模应用于文本分类,从而能够发现文档之间的语义关系并提高分类性能。
Lastly, we’ll explore the concept of topic modeling, a technique for discovering hidden thematic structures within a collection of documents. We’ll examine popular algorithms such as latent Dirichlet allocation (LDA) and describe how topic modeling can be applied to text classification, enabling the discovery of semantic relationships between documents and improving classification performance.
在本章中,我们的目标是全面了解文本分类中使用的基本概念和技术,为您提供成功解决现实世界文本分类问题所需的知识和技能。
Throughout this chapter, we aim to provide a thorough understanding of the underlying concepts and techniques that are employed in text classification, equipping you with the knowledge and skills needed to successfully tackle real-world text classification problems.
本章将涵盖以下主题:
The following topics will be covered in this chapter:
为了有效地阅读和理解本章,必须在各个技术领域拥有坚实的基础。强有力地掌握 NLP、ML 和线性代数的基本概念至关重要。要理解数据准备阶段,必须熟悉文本预处理技术,例如标记化、停用词删除、词干提取或词形还原。
To effectively read and understand this chapter, it is essential to have a solid foundation in various technical areas. A strong grasp of fundamental concepts in NLP, ML, and linear algebra is crucial. Familiarity with text preprocessing techniques, such as tokenization, stop word removal, and stemming or lemmatization, is necessary to comprehend the data preparation stage.
另外,了解基本的机器学习算法,例如逻辑回归和支持向量机( SVM ),对于实现文本分类模型至关重要。最后,熟悉准确度、精确度、召回率和 F1 分数等评估指标,以及过度拟合、欠拟合和超参数调整等概念,将有助于更深入地理解文本分类中的挑战和最佳实践。阳离子。
Additionally, understanding basic ML algorithms, such as logistic regression and support vector machines (SVMs), is crucial for implementing text classification models. Finally, being comfortable with evaluation metrics such as accuracy, precision, recall, and F1 score, along with concepts such as overfitting, underfitting, and hyperparameter tuning, will enable a deeper appreciation of the challenges and best practices in text classification.
文本分类是一项 NLP 任务,其中 ML 算法为文本分配预定义的类别或标签根据其内容。它涉及在标记数据集上训练模型,使其能够准确预测未见过的或新的文本输入的类别。文本分类方法可以分为三种主要类型——监督学习、无监督学习和半监督学习:
Text classification is an NLP task where ML algorithms assign predefined categories or labels to text based on its content. It involves training a model on a labeled dataset to enable it to accurately predict the category of unseen or new text inputs. Text classification methods can be categorized into three main types – supervised learning, unsupervised learning, and semi-supervised learning:
这些文本分类类型中的每一种都有其优点和缺点,并且适合不同类型的应用程序。了解这些类型有助于选择合适的针对给定问题的方法。在下面的小节中,我们将解释每一个遇到的问题详细介绍。
Each of these text classification types has its strengths and weaknesses and is suitable for different types of applications. Understanding these types can help in choosing the appropriate approach for a given problem. In the following subsections, we’ll explain each of these methods in detail.
监督学习是机器学习的一种,算法从标记数据中学习来预测新的、未见过的数据的标签。
Supervised learning is a type of ML where an algorithm learns from labeled data to predict the label of new, unseen data.
在上下文中在文本分类中,监督学习涉及在标记数据集上训练模型,其中每个文档或文本样本都标有相应的类别或类别。然后,模型使用此训练数据来学习文本特征及其关联标签之间的模式和关系:
In the context of text classification, supervised learning involves training a model on a labeled dataset, where each document or text sample is labeled with the corresponding category or class. The model then uses this training data to learn patterns and relationships between the text features and their associated labels:
假设标记的数据集具有最高级别的可靠性。通常,它是通过让主题专家手动审查文本并为每个项目分配适当的类别而得出的。在其他场景中,可能存在用于派生标签的自动化方法。例如,在网络安全中,您可以收集历史数据,然后分配标签,这可能会收集每个项目之后的结果 - 即该操作是否合法。由于此类历史数据存在于大多数领域,因此它也可以作为可靠的标记集。
A labeled dataset is assumed to possess the highest level of reliability. Often, it is derived by having subject matter experts manually review the text and assign the appropriate class to each item. In other scenarios, there may be automated methods for deriving the labels. For instance, in cybersecurity, you may collect historical data and then assign labels, which may collect the outcome that followed each item – that is, whether the action was legitimate or not. Since such historical data exists in most domains, that too can serve as a reliable labeled set.
一旦模型已经过训练,它可以用于根据学习到的模式来预测新的、未见过的文本数据的类别或类文本特征与其相关标签之间的关系。
Once the model has been trained, it can be used to predict the category or class of new, unseen text data based on the learned patterns and relationships between the text features and their associated labels.
监督学习算法通常用于文本分类任务。让我们看一下一些常见的监督学习算法,它们用于外部分类。
Supervised learning algorithms are commonly used for text classification tasks. Let’s look at some common supervised learning algorithms that are used for text classification.
朴素贝叶斯是常见的概率算法用于文本分类。它基于贝叶斯定理,该定理指出,在给定一些观察到的证据(在本例中为文档中的单词)的情况下,假设(在本例中为属于特定类别的文档)的概率与给定假设的证据概率乘以假设的先验概率。朴素贝叶斯假设给定类标签,特征(单词)彼此独立,这就是“朴素”部分的所在这个名字的由来。
Naive Bayes is a probabilistic algorithm that is commonly used for text classification. It is based on Bayes’ theorem, which states that the probability of a hypothesis (in this case, a document belonging to a particular class), given some observed evidence (in this case, the words in the document), is proportional to the probability of the evidence given the hypothesis times the prior probability of the hypothesis. Naive Bayes assumes that the features (words) are independent of each other given the class label, which is where the “naive” part of the name comes from.
物流回归是一种统计方法,使用对于二元分类问题(即只有两个可能类别的问题)。它使用逻辑函数对文档属于特定类别的概率进行建模,该函数将任何实值输入映射到值在 0和 1 之间。
Logistic regression is a statistical method that is used for binary classification problems (that is, problems where there are only two possible classes). It models the probability of the document belonging to a particular class using a logistic function, which maps any real-valued input to a value between 0 and 1.
支持向量机是一个使用强大的分类算法在各种应用中,包括文本分类。SVM 的工作原理是找到最能将数据分为不同类别的超平面。在文本分类中,特征通常是文档中的单词,超平面用于将所有可能文档的空间划分为对应于不同类别的不同区域。
SVM is a powerful classification algorithm that is used in a variety of applications, including text classification. SVM works by finding the hyperplane that best separates the data into different classes. In text classification, the features are typically the words in the document, and the hyperplane is used to divide the space of all possible documents into different regions corresponding to different classes.
所有这些算法都可以使用标记数据进行训练,其中类标签已知训练集中的每个文档。一次经过训练后,该模型可用于预测新的未标记文档的类标签。模型的性能通常使用准确度、精确度等指标进行评估sion、召回率和F1 分数。
All of these algorithms can be trained using labeled data, where the class labels are known for each document in the training set. Once trained, the model can be used to predict the class label of new, unlabeled documents. The performance of the model is typically evaluated using metrics such as accuracy, precision, recall, and F1 score.
无监督学习是一种 ML 类型,其中数据未标记,算法由自己寻找模式和结构。在文本分类的背景下,当没有标记时可以使用无监督学习方法可用的数据或目标是发现文本数据中隐藏的模式。
Unsupervised learning is a type of ML where the data is not labeled and the algorithm is left to find patterns and structures on its own. In the context of text classification, unsupervised learning methods can be used when there is no labeled data available or when the goal is to discover hidden patterns in the text data.
一种常见的用于文本分类的无监督学习方法是聚类。聚类算法根据内容将相似的文档分组在一起,而无需事先了解每个文档的内容。聚类可用于识别文档集合中的主题或将相似的文档分组在一起以进行进一步分析。
One common unsupervised learning method for text classification is clustering. Clustering algorithms group similar documents together based on their content, without any prior knowledge of what each document is about. Clustering can be used to identify topics in a collection of documents or to group similar documents together for further analysis.
另一种流行的用于文本分类的无监督学习算法是LDA。LDA 是一种概率生成模型,假设语料库中的每个文档都是主题的混合,并且每个主题都是单词的概率分布。LDA 可用于发现文档集合中的基础主题,即使主题没有明确标记。
Another popular unsupervised learning algorithm for text classification is LDA. LDA is a probabilistic generative model that assumes that each document in a corpus is a mixture of topics, and each topic is a probability distribution over words. LDA can be used to discover the underlying topics in a collection of documents, even when the topics are not explicitly labeled.
最后,词嵌入是一种流行的用于文本分类的无监督学习技术。词嵌入是词的密集向量表示,它根据词出现的上下文捕获词的语义。它们可用于识别相似的单词并查找单词之间的关系,这对于文本相似性和推荐系统等任务非常有用。常见的词嵌入模型包括Word2Vec和GloVe。
Finally, word embeddings are a popular unsupervised learning technique used for text classification. Word embeddings are dense vector representations of words that capture their semantic meaning based on the context in which they appear. They can be used to identify similar words and to find relationships between words, which can be useful for tasks such as text similarity and recommendation systems. Common word embedding models include Word2Vec and GloVe.
Word2Vec 是一种流行的算法,用于生成词嵌入,即高维空间中单词的向量表示。该算法由 Tomas Mikolov 领导的 Google 研究团队于 2013 年开发。Word2Vec背后的主要思想是,出现在相似上下文中的单词往往具有相似的含义。
Word2Vec is a popular algorithm that’s used to generate word embeddings, which are vector representations of words in a high-dimensional space. The algorithm was developed by a team of researchers at Google, led by Tomas Mikolov, in 2013. The main idea behind Word2Vec is that words that appear in similar contexts tend to have similar meanings.
该算法采用大量文本语料库作为输入,并为词汇表中的每个单词生成向量表示。这些向量通常是高维的(例如,100 或 300 维),可用于执行各种 NLP 任务,例如情感分析、文本分类和机器翻译。
The algorithm takes in a large corpus of text as input and generates a vector representation for each word in the vocabulary. The vectors are typically high-dimensional (for example, 100 or 300 dimensions) and can be used to perform various NLP tasks, such as sentiment analysis, text classification, and machine translation.
主要有两个Word2Vec 中使用的架构:CBOW和Skip-gram。在 CBOW 架构中,算法尝试预测给定上下文单词窗口的目标单词。在skip-gram架构中,算法尝试预测给定目标单词的上下文单词。训练目标是最大化给定输入的目标词或上下文词的可能性。
Two main architectures are used in Word2Vec: CBOW and skip-gram. In the CBOW architecture, the algorithm tries to predict the target word given a window of context words. In the skip-gram architecture, the algorithm tries to predict the context words given a target word. The training objective is to maximize the likelihood of the target word or context words given the input.
Word2Vec 已在 NLP 社区中得到广泛采用,并在各种基准测试中显示出最先进的性能。它也被用于许多现实世界的应用中,例如推荐系统微软、搜索引擎和聊天机器人。
Word2Vec has been widely adopted in the NLP community and has shown state-of-the-art performance on various benchmarks. It has also been used in many real-world applications, such as recommender systems, search engines, and chatbots.
半监督学习是一种介于监督学习和无监督学习之间的机器学习范式学习。它利用了标记的组合以及用于训练的未标记数据,当底层模型需要昂贵或耗时的标记数据时,这尤其有用。这种方法允许模型利用未标记数据中的信息来提高其分类任务的性能。
Semi-supervised learning is an ML paradigm that sits between supervised and unsupervised learning. It utilizes a combination of labeled and unlabeled data for training, which is especially useful when the underlying models require labeled data which is expensive or time-consuming. This approach allows the model to leverage the information in the unlabeled data to improve its performance on the classification task.
在文本分类的背景下,当我们有有限数量的标记文档但有大量未标记文档的语料库时,半监督学习可能会很有用。目标是通过利用未标记数据中包含的信息来提高分类器的性能。
In the context of text classification, semi-supervised learning can be beneficial when we have a limited number of labeled documents but a large corpus of unlabeled documents. The goal is to improve the performance of the classifier by leveraging the information contained in the unlabeled data.
有几种常见的半监督学习算法,包括标签传播和协同训练。我们将讨论接下来将更详细地介绍其中的每一个。
There are several common semi-supervised learning algorithms, including label propagation and co-training. We’ll discuss each of these in more detail next.
标签传播是一种基于图的半监督学习算法。它使用两者构建一个图表标记和未标记的数据点,每个数据点表示为节点,边表示相似度节点之间。该算法的工作原理是根据标记节点的相似性将标签从标记节点传播到未标记节点。
Label propagation is a graph-based semi-supervised learning algorithm. It builds a graph using both labeled and unlabeled data points, with each data point represented as a node and edges representing the similarity between nodes. The algorithm works by propagating the labels from the labeled nodes to the unlabeled nodes based on their similarity.
关键思想是相似的数据点应该具有相似的标签。该算法首先将初始标签概率分配给未标记的节点,通常基于它们与标记节点的相似性。然后,迭代过程将这些概率传播到整个图中,直到收敛。最终的标签概率用于o 对未标记的数据点进行分类。
The key idea is that similar data points should have similar labels. The algorithm begins by assigning initial label probabilities to the unlabeled nodes, typically based on their similarity to labeled nodes. Then, an iterative process propagates these probabilities throughout the graph until convergence. The final label probabilities are used to classify the unlabeled data points.
协同训练是另一种半监督学习技术,可以在不同的数据集上训练多个分类器数据的视图。视图是足以完成学习任务并且在给定类标签的情况下条件独立的特征子集。基础的想法是使用一个分类器的预测来标记一些未标记的数据,然后使用新标记的数据来训练另一个分类器。这个过程是迭代执行的,每个分类器都会改进另一个分类器,直到满足停止标准。
Co-training is another semi-supervised learning technique that trains multiple classifiers on different views of the data. A view is a subset of features that are sufficient for the learning task and are conditionally independent given the class label. The basic idea is to use one classifier’s predictions to label some of the unlabeled data, and then use that newly labeled data to train the other classifier. This process is performed iteratively, with each classifier improving the other until a stopping criterion is met.
为了在特定领域应用半监督学习,让我们考虑一个医学领域,我们希望将科学文章分为不同的类别,例如心脏病学、神经学和肿瘤学。假设我们有一小组带标签的文章和一大组未标记的文章。
To apply semi-supervised learning in a specific domain, let’s consider a medical domain where we want to classify scientific articles into different categories such as cardiology, neurology, and oncology. Suppose we have a small set of labeled articles and a large set of unlabeled articles.
一种可能的方法是通过创建文章图来使用标签传播,其中节点表示文章,边表示文章之间的相似性。相似性可能基于各种因素,例如使用的词语、涵盖的主题或文章之间的引用网络。传播标签后,我们可以根据最终的标签概率对未标记的文章进行分类。
A possible approach could be to use label propagation by creating a graph of articles where the nodes represent the articles and the edges represent the similarity between the articles. The similarity could be based on various factors, such as the words used, the topics covered, or the citation networks between the articles. After propagating the labels, we can classify the unlabeled articles based on the final label probabilities.
或者,我们可以通过将特征分成两个视图来使用协同训练,例如文章的摘要和全文。我们将训练两个分类器,每个视图一个,并使用另一个分类器对未标记数据做出的预测迭代更新分类器。
Alternatively, we could use co-training by splitting the features into two views, such as the abstract and the full text of the articles. We would train two classifiers, one for each view, and iteratively update the classifiers using the predictions made by the other classifier on the unlabeled data.
在这两种情况下,目标都是利用未标记数据中的信息来提高分类器在特定领域的性能。
In both cases, the goal is to leverage the information in the unlabeled data to improve the performance of the classifier in the specific domain.
In this chapter, we’ll elaborate on supervised text classification and topic modeling.
一热编码向量表示是一种将分类数据(例如单词)表示为二进制向量的方法。在文本分类的背景下,one-hot 编码可用于将文本数据表示为分类模型的数字输入特征。下面是text类的详细解释使用一进行化-热编码向量。
One-hot encoded vector representation is a method of representing categorical data, such as words, as binary vectors. In the context of text classification, one-hot encoding can be used to represent text data as numerical input features for a classification model. Here’s a detailed explanation of text classification using one-hot encoding vectors.
这第一步是预处理文本数据,如前一章所述。预处理的主要目标是将原始文本转换为更加结构化和一致的格式,以便机器学习算法很容易理解和处理。以下是文本预处理对于 one-hot 编码向量分类至关重要的几个原因:
The first step is to preprocess the text data, as explained in the previous chapter. The main goal of preprocessing is to transform raw text into a more structured and consistent format that can be easily understood and processed by ML algorithms. Here are several reasons why text preprocessing is essential for one-hot encoded vector classification:
一次我们对文本进行预处理,我们可以开始提取文本中的单词t。我们称此任务为词汇构建。
Once we preprocess the text, we can start extracting the words in the text. We call this task vocabulary construction.
构建一个包含预处理文本中所有唯一单词的词汇表。分配一个唯一的词汇表中每个单词的索引。
Construct a vocabulary containing all unique words in the preprocessed text. Assign a unique index to each word in the vocabulary.
词汇构建是为 Onehot 编码向量准备文本数据的重要步骤分类。词汇表是预处理文本数据中所有唯一单词(标记)的集合。它作为为每个文档创建单热编码特征向量的基础。下面详细解释一下one-hot编码向量分类的词汇构建过程:
Vocabulary construction is an essential step in preparing text data for one-hot encoded vector classification. The vocabulary is a set of all unique words (tokens) in the preprocessed text data. It serves as a basis for creating one-hot-encoded feature vectors for each document. Here’s a detailed explanation of the vocabulary construction process for one-hot encoded vector classification:
例如,考虑以下预处理数据集由两个文档组成:
该数据集的词汇表为{“apple”、“banana”、“ orange”、“grape”}。
For example, consider that the following preprocessed dataset consists of two documents:
The vocabulary for this dataset would be {“apple”, “banana”, “orange”, “grape”}.
Using the preceding example, you might assign the following indices:
和构建的词汇表和指定的索引,您现在可以为以下内容创建 one-hot 编码向量数据集中的每个文档。一种简单的创建方法one-hot 编码向量是使用bag-of-words。对于文档中的每个单词,在词汇表中找到其对应的索引,并在 one-hot-encoded 向量中将该索引处的值设置为 1。如果一个单词在文档中出现多次,则其在 one-hot-encoded 向量中的对应值保持为 1。向量中的所有其他值都将为0。
With the constructed vocabulary and assigned indices, you can now create one-hot encoded vectors for each document in the dataset. One simple approach to creating a one-hot encoded vector is to use bag-of-words. For each word in a document, find its corresponding index in the vocabulary and set the value at that index to 1 in the one-hot-encoded vector. If a word appears multiple times in the document, its corresponding value in the one-hot-encoded vector remains 1. All other values in the vector will be 0.
例如,使用前面提到的词汇表和索引,文档的 one-hot 编码向量将如下:
For example, using the vocabulary and indices mentioned previously, the one-hot encoded vectors for the documents would be as follows:
一旦我们有了每个文档的相应值,我们就可以创建一个特征矩阵,其中单热编码向量作为行,其中每行代表一个文档,每列代表词汇表中的一个单词。该矩阵将用作输入用于文本分类模型。例如,在前面的示例中,两个文档的特征向量如下:
Once we have the corresponding values for each document, we can create a feature matrix with one-hot-encoded vectors as rows, where each row represents a document and each column represents a word from the vocabulary. This matrix will be used as input for the text classification model. For example, in the previous example, the feature vectors for two documents are as follows:
|
苹果 Apple |
香蕉 Banana |
橙子 Orange |
葡萄 Grape |
|
|
文件1 Document 1 |
1 1 |
1 1 |
1 1 |
0 0 |
|
文件2 Document 2 |
1 1 |
1 1 |
0 0 |
1 1 |
表 5.1 – 两个文档的单热编码向量示例
Table 5.1 – Sample one-hot-encoded vector for two documents
请注意,通过文本预处理,它有助于减少词汇量,并为我们提供更好的模型性能。除此之外,如果需要,我们可以对提取的特征向量执行特征选择方法(如本书前面所述),以提高模型性能。
Please note that with text preprocessing, it helps to have a smaller vocabulary and it gives us better model performance. Besides that, if needed, we can perform feature selection methods (as explained previously in this book) on the extracted feature vectors to improve our model performance.
虽然从单词创建 one-hot 编码向量很有用,但有时,我们需要考虑彼此相邻的两个单词的存在。例如,“很好”和“不好”可以有不同的含义。为了实现这个目标,我们可以使用 N-gram。
While creating a one-hot encoded vector from words is useful, sometimes, we need to consider the existence of two words beside each other. For example, “very good” and “not good” can have different meanings. To achieve this goal, we can use N-grams.
N-gram是词袋模型的推广,该模型通过考虑n 个连续单词的序列来考虑单词的顺序。N 元语法是给定文本中n 个项目(通常是单词)的连续序列。例如,在句子“The cat is on the mat”中,2-grams(二元词)将是“The cat”、“cat is”、“is on”、“on the”和“ the mat”。
N-grams are a generalization of the bag-of-words model that takes into account the order of words by considering sequences of n consecutive words. An N-gram is a contiguous sequence of n items (typically words) from a given text. For example, in the sentence “The cat is on the mat,” the 2-grams (bigrams) would be “The cat,” “cat is,” “is on,” “on the,” and “the mat.”
使用 N-gram 可以帮助捕获本地上下文和单词关系,这可以提高分类器的性能。然而,它也增加了特征空间的维数,这可能会导致计算成本高昂。
Using N-grams can help capture local context and word relationships, which may improve the performance of the classifier. However, it also increases the dimensionality of the feature space, which can be computationally expensive.
在特征矩阵上训练 ML 模型,例如逻辑回归、SVM 或神经网络学习one-hot编码文本特征和目标标签之间的关系。该模型将学习根据文档中是否存在特定单词来预测类别标签。一旦我们决定了训练过程,我们需要执行以下任务:
Train an ML model, such as logistic regression, SVM, or neural networks, on the feature matrix to learn the relationship between the one-hot encoded text features and the target labels. The model will learn to predict the class label based on the presence or absence of specific words in the document. Once we’ve decided on the training process, we need to perform the following tasks:
使用 one-hot 编码向量进行文本分类的一个潜在限制是它们不能捕获词序、上下文或单词之间的语义关系。这可能会导致性能不佳,尤其是在更复杂的分类任务中。在这些情况下,更先进的技术,例如词嵌入(例如,Word2Vec 或 GloVe)或深度学习模型(例如,CNN 或 RNN),可以为文本数据提供更好的表示。
One potential limitation of using one-hot encoded vectors for text classification is that they do not capture word order, context, or semantic relationships between words. This can lead to suboptimal performance, especially in more complex classification tasks. More advanced techniques, such as word embeddings (for example, Word2Vec or GloVe) or deep learning models (for example, CNNs or RNNs), can provide better representations for text data in these cases.
总之,使用 one-hot 编码向量进行文本分类涉及预处理文本数据、构建词汇表、将文本数据表示为 one-hot 编码特征向量、在特征向量上训练 ML 模型,以及评估模型并将其应用于新文本数据。单热编码向量表示是一种简单但有时有限的文本分类方法,对于复杂的任务可能需要更先进的技术。
In summary, text classification using one-hot-encoded vectors involves preprocessing text data, constructing a vocabulary, representing text data as one-hot encoded feature vectors, training an ML model on the feature vectors, and evaluating and applying the model to new text data. The one-hot encoded vector representation is a simple but sometimes limited approach to text classification, and more advanced techniques may be necessary for complex tasks.
到目前为止,我们已经了解了如何使用 N 元语法对文档进行分类。然而,这种方法有一个缺点。文档中有相当多的单词经常出现,并且不会为我们的模型增加价值。为了改进模型,使用 TF-IDF 进行文本分类已得到支持奥塞德。
So far, we’ve learned about classifying documents using N-grams. However, this approach has a drawback. There are a considerable number of words that occur in the documents frequently and do not add value to our models. To improve the models, text classification using TF-IDF has been proposed.
One-hot 编码向量是一种执行分类的好方法。然而,它的弱点之一是它没有考虑基于不同文档的不同单词的重要性。为了解决这个问题,使用TF-IDF可能会有所帮助。
One-hot encoded vector is a good approach to perform classification. However, one of its weaknesses is that it does not consider the importance of different words based on different documents. To solve this issue, using TF-IDF can be helpful.
TF-IDF是一种数值统计量,用于衡量文档集合中文档中单词的重要性。它有助于反映文档中单词的相关性,考虑到不不仅包括它们在文档中的出现频率,还包括它们在整个文档集中的稀有度。单词的 TF-IDF 值与其在文档中的频率成比例增加,但会因该单词在整个文档集合中的频率而偏移。
TF-IDF is a numerical statistic that is used to measure the importance of a word in a document within a document collection. It helps reflect the relevance of words in a document, considering not only their frequency within the document but also their rarity across the entire document collection. The TF-IDF value of a word increases proportionally to its frequency in a document but is offset by the frequency of the word in the entire document collection.
以下是计算 TF-IDF所涉及的数学方程的详细解释:
Here’s a detailed explanation of the mathematical equations involved in calculating TF-IDF:
TF 衡量特定文档中单词的重要性。
The TF measures the importance of a word within a specific document.
ID F ( t ) = log ( (集合中的文档总数) / (包含单词的文档数量))
对数用于抑制 IDF 分量的影响。如果一个词出现在很多文档中,它的IDF值会更接近0,如果它出现在较少的文档中,它的IDF值会更高。
IDF(t) = log ((Total number of documents in the collection) / (Number of documents containing word ′t′))
The logarithm is used to dampen the effect of the IDF component. If a word appears in many documents, its IDF value will be closer to 0, and if it appears in fewer documents, its IDF value will be higher.
生成的 TF-IDF 值表示文档中某个单词的重要性,同时考虑到该单词在文档中的出现频率及其在整个文档集合中的稀有性。高 TF-IDF 值表示在特定文档中更重要的单词,而低 TF-IDF 值表示在所有文档中常见或在特定文档中罕见的单词。
The resulting TF-IDF value represents the importance of a word in a document, taking into account both its frequency within the document and its rarity across the entire document collection. High TF-IDF values indicate words that are more significant in a particular document, whereas low TF-IDF values indicate words that are either common across all documents or rare within the specific document.
让我们考虑一个简单的例子电影分类评论分为两类:正面和负面。我们有一个小数据集,其中包含三个电影评论及其各自的标签,如下所示:
Let’s consider a simple example of classifying movie reviews into two categories: positive and negative. We have a small dataset with three movie reviews and their respective labels, as follows:
现在,我们将使用 TF-IDF 对新的、未见过的电影评论进行分类:
Now, we will use TF-IDF to classify a new, unseen movie review:
以下是让分类器预测文档类别所需执行的步骤:
Here are the steps that we need to perform to have the classifier predict the class of our document:
词汇:{“love”、“movi”、“act”、“great”、“stori”、“captiv”、“bore”、“not”、“like”、“terribl”、“amaz”、“wonder” 、“才华横溢”、“感兴趣”、“好”}
Vocabulary: {“love”, “movi”, “act”, “great”, “stori”, “captiv”, “bore”, “not”, “like”, “terribl”, “amaz”, “wonder”, “brilliant”, “interest”, “good”}
For example, for the word “stori” in Document 4, we have the following:
4. 步骤 4 – 计算 TF-IDF 值:计算每个文档中每个单词的 TF-IDF 值。
4. Step 4 – compute the TF-IDF values: Calculate the TF-IDF values for each word in each document.
对所有文档中的所有单词重复此过程,并使用TF-IDF 值创建特征矩阵。
Repeat this process for all words in all documents and create a feature matrix with the TF-IDF values.
5. 步骤 5 – 训练分类器:将数据集分为训练集(文档 1 至 3)和测试集(文档 4)。使用训练集的 TF-IDF 特征矩阵及其相应的标签(正或负)训练分类器,例如逻辑回归或 SVM 。
5. Step 5 – train a classifier: Split the dataset into a training set (documents 1 to 3) and a test set (document 4). Train a classifier, such as logistic regression or SVM, using the training set’s TF-IDF feature matrix and their corresponding labels (positive or negative).
6. 步骤 6 – 预测类标签:使用相同的词汇预处理并计算新电影评论(文档 4)的 TF-IDF 值。使用经过训练的分类器根据文档 4 的 TF-IDF特征向量预测其类标签。
6. Step 6 – predict the class label: Preprocess and compute the TF-IDF values for the new movie review (document 4) using the same vocabulary. Use the trained classifier to predict the class label for document 4 based on its TF-IDF feature vector.
例如,如果分类器预测文档 4 为正标签,则分类结果将如下:
For example, if the classifier predicts a positive label for document 4, the classification result would be as follows:
通过执行以下步骤,您可以使用 TF-IDF 表示对文本文档进行分类根据单词的重要性相对于整个文档集合的文档。
By following these steps, you can use the TF-IDF representation to classify text documents based on the importance of words in the documents relative to the entire document collection.
总之,TF-IDF 值是使用 TF 和 IDF 的数学方程计算的。它衡量文档中某个单词相对于整个文档集合的重要性,同时考虑该单词在文档中的出现频率及其在整个文档集合中的稀有性。所有文件。
In summary, the TF-IDF value is calculated using the mathematical equations for TF and IDF. It serves as a measure of the importance of a word in a document relative to the entire document collection, considering both the frequency of the word within the document and its rarity across all documents.
中的一个执行文本分类的方法是将单词转换为嵌入向量,以便您可以使用这些向量进行分类。Word2Vec 是一种众所周知的方法完成这项任务。
One of the methods to perform text classification is to convert the words into embedding vectors so that you can use those vectors for classification. Word2Vec is a well-known method to perform this task.
词向量是一组基于神经网络的模型,用于创建词嵌入,词嵌入是连续向量空间中单词的密集向量表示。这些嵌入根据单词在文本中出现的上下文来捕获单词之间的语义和关系。Word2Vec 有两种主要架构。如前所述,用于学习词嵌入的两种主要架构是CBOW和skip-gram。这两种架构都旨在通过根据周围上下文预测单词来学习单词嵌入:
Word2Vec is a group of neural network-based models that are used to create word embeddings, which are dense vector representations of words in a continuous vector space. These embeddings capture the semantic meaning and relationships between words based on the context in which they appear in the text. Word2Vec has two main architectures. As mentioned previously, the two main architectures that were designed to learn word embeddings are CBOW and skip-gram. Both architectures are designed to learn word embeddings by predicting words based on their surrounding context:
在 CBOW 模型中,目标是在给定上下文单词的情况下最大化观察目标单词的平均对数概率:
In the CBOW model, the objective is to maximize the average log probability of observing the target word given the context words:
这里,T是文本中的单词总数,P(target | context)是给定上下文单词的情况下观察目标单词的概率,使用softmax函数计算:
Here, T is the total number of words in the text, and P(target | context) is the probability of observing the target word given the context words, which is calculated using the softmax function:
这里,是目标单词的输出向量(单词嵌入),是上下文单词的平均输入向量(上下文单词嵌入),分母中的和遍历词汇表中的所有单词。
Here, is the output vector (word embedding) of the target word, is the average input vector (context word embedding) of the context words, and the sum in the denominator runs over all words in the vocabulary.
在skip-gram模型中,目标是最大化观察给定目标词的上下文词的平均对数概率:
In the skip-gram model, the objective is to maximize the average log probability of observing the context words given the target word:
这里,T是总的文本中的单词数,P(context | target) 是给定目标单词的情况下观察上下文单词的概率,使用softmax 函数计算:
Here, T is the total number of words in the text, and P(context | target) is the probability of observing the context words given the target word, which is calculated using the softmax function:
这里,是上下文单词的输出向量(上下文单词嵌入),是目标单词的输入向量(单词嵌入),分母中的和遍历词汇表中的所有单词。
Here, is the output vector (context word embedding) of the context word, is the input vector (word embedding) of the target word, and the sum in the denominator runs over all words in the vocabulary.
CBOW 和skip-gram 的训练过程都涉及迭代文本并更新使用随机梯度下降( SGD ) 和反向传播的输入和输出权重矩阵,以最小化预测单词和实际单词之间的差异。学习到的输入权重矩阵包含每个 wo 的词嵌入词汇表中的rd 。
The training process for both CBOW and skip-gram involves iterating through the text and updating the input and output weight matrices using stochastic gradient descent (SGD) and backpropagation to minimize the difference between the predicted words and the actual words. The learned input weight matrix contains the word embeddings for each word in the vocabulary.
文本分类使用 Word2Vec 涉及创建使用 Word2Vec 算法进行词嵌入,然后训练 ML 模型以根据这些嵌入对文本进行分类。以下步骤详细概述了该过程,包括数学方面:
Text classification using Word2Vec involves creating word embeddings using the Word2Vec algorithm and then training an ML model to classify text based on these embeddings. The following steps outline the process in detail, including the mathematical aspects:
这里,T是文本中的单词总数,P(context | target)是给定目标单词的情况下观察上下文单词的概率,使用softmax函数计算:
Here, T is the total number of words in the text, and P(context | target) is the probability of observing the context words given the target word, which is calculated using the softmax function:
这里,是上下文单词的输出向量(上下文单词嵌入),是目标单词的输入向量(单词嵌入),分母中的和遍历词汇表中的所有单词。
Here, is the output vector (context word embedding) of the context word, is the input vector (word embedding) of the target word, and the sum in the denominator runs over all words in the vocabulary.
3. 创建文档嵌入:对于数据集中的每个文档,通过平均文档中单词的词嵌入来计算文档嵌入:
3. Create document embeddings: For each document in the dataset, calculate the document embedding by averaging the word embeddings of the words in the document:
这里,N 是文档中的单词数,并且对文档中的所有单词进行求和。请注意,根据我们的经验,这种使用 Word2Vec 进行文本分类的方法仅在文档长度较短时有用。如果您的文件较长或其中有相反的词文档中,这种方法效果不佳。另一种解决方案是结合使用 Word2Vec 和 CNN获取单词嵌入,然后将这些嵌入作为CNN 的输入。
Here, N is the number of words in the document, and the sum runs over all words in the document. Please note that based on our experience, this approach for text classification using Word2Vec is only useful when the document’s length is short. If you have longer documents or there are opposite words in the document, this approach won’t perform well. An alternative solution is to use Word2Vec and CNN together to fetch the word embeddings and then feed those embeddings as input of the CNN.
4. 模型训练:使用文档嵌入作为特征来训练 ML 模型,例如逻辑回归、SVM 或神经网络,以进行文本分类。该模型学习根据文档嵌入来预测类标签。
4. Model training: Use the document embeddings as features to train an ML model, such as logistic regression, SVM, or a neural network, for text classification. The model learns to predict the class label based on the document embeddings.
5. 模型评估:使用适当的评估指标(例如准确度、精确度、召回率、F1 分数或混淆矩阵)来评估模型的性能,并使用交叉验证等技术来获得模型在未见过的情况下的性能的可靠估计数据。
5. Model evaluation: Evaluate the performance of the model using appropriate evaluation metrics, such as accuracy, precision, recall, F1 score, or confusion matrix, and use techniques such as cross-validation to get a reliable estimate of the model’s performance on unseen data.
6. 模型应用:将训练好的模型应用到新的、未见过的文本数据上。使用相同的 Word2Vec 模型和词汇表预处理和计算新文本数据的文档嵌入,并使用该模型来预测类标签。
6. Model application: Apply the trained model to new, unseen text data. Preprocess and compute the document embeddings for the new text data using the same Word2Vec model and vocabulary, and use the model to predict the class labels.
总之,使用 Word2Vec 进行文本分类涉及使用 Word2Vec 算法创建词嵌入、对这些嵌入进行平均以创建文档嵌入,以及训练 ML 模型以基于这些文档嵌入对文本进行分类。Word2Vec 算法通过最大化观察给定目标单词的上下文单词的平均对数概率来学习单词嵌入,捕获语义关系在此过程中补间单词。
In summary, text classification using Word2Vec involves creating word embeddings with the Word2Vec algorithm, averaging these embeddings to create document embeddings, and training an ML model to classify text based on these document embeddings. The Word2Vec algorithm learns word embeddings by maximizing the average log probability of observing context words given a target word, capturing the semantic relationships between words in the process.
评估文本分类模型的性能对于确保它们满足所需的准确性和普遍性水平。评估文本通常使用多种指标和技术分类模型,包括准确率、精确率、召回率、F1 分数和混淆矩阵。来!我们讨论一下其中每一个都更详细:
Evaluating the performance of text classification models is crucial to ensure that they meet the desired level of accuracy and generalizability. Several metrics and techniques are commonly used to evaluate text classification models, including accuracy, precision, recall, F1 score, and confusion matrix. Let’s discuss each of these in more detail:
虽然准确性很容易理解,但对于不平衡数据集来说,它可能不是最佳指标,因为大多数类不能n 主导度量值。
While accuracy is easy to understand, it may not be the best metric for imbalanced datasets, where the majority class can dominate the metric’s value.
在处理多类分类时,我们有F1微观和F1宏观。F1 微观和 F1 宏观是计算多类或多标签分类问题的 F1 分数的两种方法。它们以不同的方式聚合精度和召回率,从而导致对分类器性能的不同解释。让我们更详细地讨论每一个:
When dealing with multi-class classification, we have F1 micro and F1 macro. F1 micro and F1 macro are two ways to compute the F1 score for multi-class or multi-label classification problems. They aggregate precision and recall differently, leading to different interpretations of the classifier’s performance. Let’s discuss each in more detail:
这里,n是班级数,是第i个班级的F1分数。
Here, n is the number of classes, and is the F1 score for the i-th class.
当您想要评估分类器在所有类中的性能而不给予多数类更多权重时,F1 宏特别有用。然而,它可能当班级分布高度不平衡时不适合,因为它可以提供过于乐观的结果模型性能的估计。
F1 macro is particularly useful when you want to evaluate the performance of a classifier across all classes without giving more weight to the majority class. However, it may not be suitable when the class distribution is highly imbalanced as it can provide an overly optimistic estimate of the model’s performance.
这里,全局精度和全局召回率计算如下:
Here, global precision and global recall are calculated as follows:
当您想要考虑类分布来评估分类器的整体性能时,F1 micro 非常有用,尤其是在处理不平衡数据集时。
F1 micro is useful when you want to evaluate the overall performance of a classifier considering the class distribution, especially when dealing with imbalanced datasets.
总之,F1 宏和 F1 微观是计算多类或多标签分类问题的 F1 分数的两种方法。无论类分布如何,F1 宏都将每个类视为同等重要,而 F1 微则通过考虑每个类中的实例数量来考虑类不平衡。F1宏和F1微之间的选择取决于具体问题以及是否类imbalance是一个需要考虑的重要因素。
In summary, F1 macro and F1 micro are two ways to compute the F1 score for multi-class or multi-label classification problems. F1 macro treats each class as equally important, regardless of the class distribution, while F1 micro takes class imbalance into account by considering the number of instances in each class. The choice between F1 macro and F1 micro depends on the specific problem and whether class imbalance is an important factor to consider.
混淆矩阵用作表格表示,显示分类模型做出的真阳性、真阴性、假阳性和假阴性预测的计数。该矩阵提供了模型功效的细致入微的视角,使人们能够全面理解其优点和缺点。
A confusion matrix serves as a tabular representation, showcasing the count of true positive, true negative, false positive, and false negative predictions made by a classification model. This matrix offers a nuanced perspective on the model’s efficacy, enabling a thorough comprehension of both its strengths and weaknesses.
对于二分类问题,混淆矩阵排列如下:
For a binary classification problem, the confusion matrix is arranged as follows:
|
实际/预测 Actual/Predicted |
(预测)积极 (Predicted) Positive |
(预测)阴性 (Predicted) Negative |
|
(实际)积极 (Actual) Positive |
真阳性 True Positive |
假阴性 False Negative |
|
(实际)负数 (Actual) Negative |
假阳性 False Positive |
真阴性 True Negative |
表 5.2 – 混淆矩阵 – 总体视图
Table 5.2 – Confusion matrix – general view
对于多类分类问题,混淆矩阵被扩展为包括每个类的真实计数和预测计数。对角线元素表示正确分类的实例,而非对角线元素表示错误分类。
For multi-class classification problems, the confusion matrix is extended to include the true and predicted counts for each class. The diagonal elements represent the correctly classified instances, while the off-diagonal elements represent misclassifications.
总之,评估文本分类模型涉及使用各种指标和技术,例如准确度、精确度、召回率、F1 分数和混淆矩阵。选择适当的评估指标取决于具体问题、数据集特征以及误报和漏报之间的权衡。使用多个指标评估模型可以更全面地了解其性能,并有助于指导进一步的改进。
In summary, evaluating text classification models involves using various metrics and techniques, such as accuracy, precision, recall, F1 score, and the confusion matrix. Selecting the appropriate evaluation metrics depends on the specific problem, dataset characteristics, and the trade-offs between false positives and false negatives. Evaluating a model using multiple metrics can provide a more comprehensive understanding of its performance and help guide further improvements.
过拟合和欠拟合是机器学习模型(包括文本分类模型)训练过程中出现的两个常见问题。它们都与模型对新的、未见过的数据的泛化程度有关。本节将解释过度拟合和欠拟合、何时发生以及如何预防它们。
Overfitting and underfitting are two common issues that arise during the training of ML models, including text classification models. They both relate to how well a model generalizes to new, unseen data. This section will explain overfitting and underfitting, when they happen, and how to prevent them.
当模型过度适应训练数据的复杂性时,就会出现过度拟合。在这种情况下,模型捕获噪声和随机波动,而不是辨别基本模式。因此,尽管模型可能在训练数据上表现出高性能,但当应用于看不见的数据(例如验证或测试集)时,其有效性会降低。
Overfitting arises when a model excessively tailors itself to the intricacies of the training data. In this case, the model captures noise and random fluctuations rather than discerning the fundamental patterns. Consequently, although the model may exhibit high performance on the training data, its effectiveness diminishes when applied to unseen data, such as a validation or test set.
为了避免文本分类中的过度拟合,请考虑以下策略:
To avoid overfitting in text classification, consider the following strategies:
接下来,我们将讨论欠拟合。
Next, we’ll cover underfitting.
当模型过于简单并且无法捕获数据中的潜在模式时,就会发生欠拟合。因此,模型在训练和测试数据上的性能都很低。该模型过于简单,无法表示数据的复杂性,并且不能很好地概括。
Underfitting happens when a model is too simple and fails to capture the underlying patterns in the data. Consequently, the model performance is low on both training and test data. The model is too simple to represent the complexity of the data and can’t generalize well.
为了避免文本分类中的欠拟合,请考虑以下策略:
To avoid underfitting in text classification, consider the following strategies:
总之,过度拟合和欠拟合是文本分类中的两个常见问题,影响模型泛化到新数据的能力。避免这些问题涉及平衡模型复杂性、使用适当的特征、调整超参数、采用正则化以及监控验证集上的模型性能。通过解决过度拟合和欠拟合问题,您可以提高文本分类模型的性能和通用性。
In summary, overfitting and underfitting are two common issues in text classification that affect a model’s ability to generalize to new data. Avoiding these issues involves balancing model complexity, using appropriate features, tuning hyperparameters, employing regularization, and monitoring model performance on a validation set. By addressing overfitting and underfitting, you can improve the performance and generalizability of your text classification models.
构建有效分类模型的一个重要步骤是超参数调整。超参数是训练前定义的模型参数;它们在训练期间不会改变。这些参数决定了模型架构和行为。可以使用的一些超参数包括学习率和迭代次数。它们可以显着影响模型的性能和通用性。
An important step in building an effective classification model is hyperparameter tuning. Hyperparameters are the model parameters that are defined before training; they will not change during training. These parameters determine the model architecture and behavior. Some of the hyperparameters that can be used are the learning rate and the number of iterations. They can significantly impact the model’s performance and generalizability.
文本分类中的超参数调整过程涉及以下步骤:
The process of hyperparameter tuning in text classification involves the following steps:
超参数调整通过找到最佳参数组合来影响模型的性能,从而在所选评估指标上产生最佳模型性能。调整超参数可以帮助解决过拟合和欠拟合等问题,平衡模型复杂性,提高模型泛化到新数据的能力。
Hyperparameter tuning affects the performance of the model by finding the optimal combination of parameters that results in the best model performance on the chosen evaluation metric. Tuning hyperparameters can help address issues such as overfitting and underfitting, balance model complexity, and improve the model’s ability to generalize to new data.
超参数调整是文本分类中的一个关键过程,涉及搜索模型参数的最佳组合,以最大限度地提高所选评估指标的性能。通过仔细调整超参数,您可以提高文本分类模型的性能和通用性。
Hyperparameter tuning is a crucial process in text classification that involves searching for the optimal combination of model parameters to maximize performance on a chosen evaluation metric. By carefully tuning hyperparameters, you can improve the performance and generalizability of your text classification models.
在现实世界中,应用文本分类涉及由于现实世界数据的性质和问题要求而产生的各种实际考虑因素和挑战。一些常见问题包括处理不平衡的数据集、处理噪声数据以及选择适当的评估指标。
In the real world, applying text classification involves various practical considerations and challenges that arise from the nature of real-world data and problem requirements. Some common issues include dealing with imbalanced datasets, handling noisy data, and choosing appropriate evaluation metrics.
让我们更详细地讨论其中的每一个。
Let’s discuss each of these in more detail.
文本分类任务经常遇到不平衡的数据集,其中某些类别的实例数量明显多于其他类别。这种不平衡可能会导致模型出现偏差,在预测多数类别方面表现出色,但在准确分类少数类别方面却表现不佳。要处理不平衡的数据集,请考虑以下策略:
Text classification tasks often encounter imbalanced datasets, wherein certain classes boast a notably higher number of instances compared to others. This imbalance can result in models that are skewed, excelling in predicting the majority class while faltering in accurately classifying the minority class. To handle imbalanced datasets, consider the following strategies:
要处理噪声数据,请考虑以下策略:
To handle noisy data, consider the following strategies:
无论我们是否处理不平衡的数据,我们总是需要评估我们的模型,并且选择正确的指标来评估我们的模型非常重要。接下来,我们将解释如何选择最佳指标来评估我们的模型。
Whether we’re working on imbalanced data or not, we always need to evaluate our model, and choosing the right metric to evaluate our model is important. Next, we’ll explain how to select the best metric to evaluate our model.
选择正确的评估指标对于衡量文本分类模型的性能和指导模型改进至关重要。
Selecting the right evaluation metrics is crucial for measuring the performance of your text classification model and guiding model improvements.
选择评估指标时请考虑以下因素:
Consider the following when choosing evaluation metrics:
总之,文本分类中的实际考虑因素包括处理不平衡数据集、处理噪声数据以及选择适当的评估指标。解决这些问题有助于提高文本分类模型的性能和通用性,并确保它们满足问题的特定要求。
In summary, practical considerations in text classification include dealing with imbalanced datasets, handling noisy data, and choosing appropriate evaluation metrics. Addressing these issues can help improve the performance and generalizability of your text classification models and ensure that they meet the specific requirements of your problem.
主题建模是一种无监督的机器学习技术,用于发现大量文档中的抽象主题或主题。它假设每个文档可以表示为主题的混合,并且每个主题表示为单词的分布。主题建模的目标是找到潜在的主题及其单词分布,以及每个文档的主题比例。
Topic modeling is an unsupervised ML technique that’s used to discover abstract topics or themes within a large collection of documents. It assumes that each document can be represented as a mixture of topics, and each topic is represented as a distribution over words. The goal of topic modeling is to find the underlying topics and their word distributions, as well as the topic proportions for each document.
有多种主题建模算法,但最流行和最广泛使用的一种是 LDA。我们将详细讨论 LDA,包括其数学公式。
There are several topic modeling algorithms, but one of the most popular and widely used is LDA. We will discuss LDA in detail, including its mathematical formulation.
LDA 是一种生成概率模型,它为每个文档假设以下生成过程:
LDA is a generative probabilistic model that assumes the following generative process for each document:
生成过程是 LDA 用于根据假定主题对原始文档进行逆向工程的理论模型。
The generative process is a theoretical model used by LDA to reverse-engineer the original documents from presumed topics.
LDA 旨在找到最能解释观察到的文档的主题-单词分布 ( φ ) 和文档-主题分布 ( θ ) 。
LDA aims to find the topic-word distributions (φ) and document-topic distributions (θ) that best explain the observed documents.
在数学上,LDA 可以使用以下符号进行描述:
Mathematically, LDA can be described using the following notation:
给定主题-单词分布 ( φ ) 和文档-主题分布 ( θ ),文档中主题分配 ( z ) 和单词 ( w )的联合概率可以写如下:
The joint probability of the topic assignments (z) and words (w) in the documents, given the topic-word distributions (φ) and document-topic distributions (θ), can be written as follows:
LDA 的目标是在给定狄利克雷先验 α和 β 的情况下最大化观察到的单词的可能性:
The objective of LDA is to maximize the likelihood of the observed words given the Dirichlet priors α and β:
然而,由于潜变量 θ 和 φ 的积分,直接计算似然度是很困难的。因此,LDA使用近似推理算法,例如吉布斯采样或变分推理来估计后验分布P (θ | w, α, β)和P (φ | w, α, β)。
However, computing the likelihood directly is intractable due to the integration over the latent variables θ and φ. Therefore, LDA uses approximate inference algorithms, such as Gibbs sampling or variational inference, to estimate the posterior distributions P(θ | w, α, β) and P(φ | w, α, β).
一旦后验分布被估计出来,我们就可以获得文档主题分布(θ)和主题词分布(φ),它们可以用来分析发现的主题及其单词分布,以及每个主题的比例文档。
Once the posterior distributions have been estimated, we can obtain the document-topic distributions (θ) and topic-word distributions (φ), which can be used to analyze the discovered topics and their word distributions, as well as the topic proportions for each document.
让我们考虑主题建模的一个简单示例。
Let’s consider a simple example of topic modeling.
假设我们有三个文档的集合:
Suppose we have a collection of three documents:
我们想要在此文档集合中发现两个主题(K = 2)。以下是我们需要执行的步骤:
We want to discover two topics (K = 2) in this document collection. Here are the steps that we need to perform:
对于我们的示例,LDA 可能会发现以下主题:
对于这些主题,文档主题分布 (θ) 可能如下所示:
For our example, LDA might discover the following topics:
With these topics, the document-topic distribution (θ) might look like this:
在此示例中,主题 1 似乎与足球和体育相关,而主题 2 似乎与技术和小玩意相关。每个文档的主题分布显示,文档 1 和 2 主要是关于足球的,而文档 3 是关于技术的。
In this example, topic 1 seems to be related to football and sports, while topic 2 seems to be related to technology and gadgets. The topic distributions for each document show that documents 1 and 2 are mostly about football, while document 3 is about technology.
请注意,这是一个简化的示例,现实世界的数据需要更复杂的预处理和更多的迭代才能收敛。
Please note that this is a simplified example, and real-world data would require more sophisticated preprocessing and a larger number of iterations for convergence.
我们现在准备讨论在工作或研究环境中构建完整项目的范例。
We are now ready to discuss the paradigm for putting together a complete project in a work or research setting.
本节致力于我们讨论的各种方法的实际实施。它将围绕 Python 代码展开,Python 代码充当完整的管道。
This section is dedicated to the practical implementation of the various methods we discussed. It will revolve around Python code, which serves as a complete pipeline.
为了提供全面的学习体验,我们将讨论典型机器学习项目的整个过程。图 5 .1描述了 ML 项目的不同阶段:
To provide a comprehensive learning experience, we will discuss the entire journey of a typical ML project. Figure 5.1 depicts the different phases of the ML project:
图 5.1 – 典型机器学习项目的范例
Figure 5.1 – The paradigm of a typical ML project
让我们以与行业中典型项目类似的方式来分解问题。
Let’s break the problem down in a similar fashion to a typical project in the industry.
机器学习项目,无论是在商业还是研究环境中,都源于最初的目标,该目标通常是定性的而不是技术性的。
An ML project, whether in a business or research setting, stems from an original objective, which is often qualitative rather than technical.
这是一个例子:
Here’s an example:
接下来是技术目标。
Next comes the technical objective.
最初的目标需要转化为技术目标,如下所示:
The original objective needs to be translated into a technical objective, like so:
虽然最初的业务或研究目标在某种程度上是一个开放式问题,但技术目标反映了一个可行的计划。但请注意,任何给定的技术目标仅代表与原始业务或研究目标一致的几个潜在解决方案之一。技术权威(例如 CTO、ML 经理或高级开发人员)有责任了解原始目标并将其转化为技术目标。此外,技术目标可能会被细化,甚至被替换。形成技术目标后的下一步是为其制定计划。
While the original business or research objective is somewhat of an open-ended question, the technical objective reflects an actionable plan. Note, however, that any given technical objective represents just one among several potential solutions aligned with the original business or research aim. It is the responsibility of the technical authority, such as the CTO, ML manager, or senior developer, to understand the original objective and translate it into a technical objective. Moreover, it may be that the technical objective would be refined or even replaced down the line. The next step after forming a technical objective is to form a plan for it.
为了实现技术目标,我们需要制定一个计划来决定将哪些数据输入到机器学习系统中,以及机器学习系统的预期输出是什么。在项目的第一步中,可能有多个潜在数据的候选来源,这些数据被认为可以指示所需的输出。
To realize the technical objective, we need to derive a plan to decide which data would be used to feed into the ML system, and what the expected output of the ML system is. In the first steps of a project, there may be several candidate sources of potential data that are believed to be indicative of the desired output.
继前面提到的三个示例之后,以下是一些数据描述的示例:
Following the set of three examples mentioned previously, here are some examples of data descriptions:
在定义潜在的解决方案时,应特别注意确定要关注的最佳指标,也称为目标函数或误差函数。这是评估解决方案是否成功的指标。将指标与原始业务或研究目标联系起来非常重要。
When defining a potential solution approach, extra attention should be dedicated to identifying the best metric to focus on, also known as the objective function or error function. This is the metric by which the success of the solution will be evaluated. It is important to relate the metric to the original business or research objective.
根据前面的示例,我们可以有以下内容:
As per the previous examples, we could have the following:
现在我们有了初步计划,我们可以探索数据并评估设计的可行性。
Now that we have a tentative plan, we can explore the data and evaluate the feasibility of the design.
探索分为两部分——探索数据和探索设计的可行性。让我们仔细看看。
Exploration is divided into two parts – exploring the data and exploring the feasibility of the design. Let’s take a closer look.
数据并不总是完美地实现我们的目标。我们在前面的章节中讨论了一些数据缺陷。特别是,自由文本常常因存在许多异常现象而臭名昭著,例如编码、特殊字符、拼写错误等。在探索数据时,我们希望揭示所有这些现象,并确保数据能够转化为服务于目标的形式。
Data is not always perfect for our objective. We discussed some of the data shortcomings in previous chapters. In particular, free text is often notorious for having many abnormal phenomena, such as encodings, special characters, typos, and so on. When exploring our data, we want to uncover all these phenomena and make sure that the data can be brought to a form that serves the objective.
在这里,我们希望前瞻性地确定计划设计是否有望成功的指标。虽然对于某些问题,存在已知的预期成功指标,但在商业尤其是研究环境中的大多数问题中,需要大量经验和独创性来提出成功的初步指标。
Here, we want to prospectively identify proxies for whether the planned design is expected to succeed. While with some problems there are known proxies for expected success, in most problems in the business and especially research setting, it takes much experience and ingenuity to suggest preliminary proxies for success.
一个非常简单的例子是具有单个输入变量和单个输出变量的简单回归问题。假设自变量是您的流媒体服务当前拥有的活跃观众数量,因变量是公司服务器超出其容量的风险。暂定的设计计划是建立一个回归器来估计任何给定时刻的风险。开发成功回归器的可行性的有力代理可以是计算历史数据点之间的线性相关性。基于样本数据计算线性相关性既简单又快速,如果其结果接近 1(在与我们的业务问题不同的情况下为 -1),则意味着线性回归器一定会成功,因此,使其成为一个很好的选择代理人。但请注意,如果线性相关性接近 0,并不一定意味着回归器会失败,只是线性回归会失败。在这种情况下,应推迟到不同的代理。
An example of a very simple case is a simple regression problem with a single input variable and a single output variable. Let’s say the independent variable is the number of active viewers that your streaming service currently has, and the dependent variable is the risk that the company’s servers have for maxing out their capacity. The tentative design plan would be to build a regressor that estimates the risk at any given moment. A strong proxy for the feasibility of developing a successful regressor could be calculating the linear correlation between the historical data points. Calculating linear correlation based on sample data is easy and quick and if its result is close to 1 (or -1 in cases different than our business problem), then it means that a linear regressor is guaranteed to succeed, thus, making it a great proxy. However, note that if the linear correlation is close to 0, it doesn’t necessarily mean that a regressor would fail, only that a linear regression would fail. In such a case, a different proxy should be deferred to.
在回顾我们的用例 - Jupyter Notebook 中用于 NLP 分类的 ML 系统设计部分中,我们将回顾我们的代码解决方案。我们还将提出一种评估文本分类器可行性的方法。该方法旨在模拟输入文本与输出类之间的关系。但由于我们希望该方法适合文本而不是数字的变量,因此我们将回到原点并计算输入文本和输出类之间的统计依赖性的度量。统计相关性是变量之间关系的最基本度量,因此不需要它们中的任何一个都是数字。
In the Reviewing our use case – ML system design for NLP classification in a Jupyter Notebook section, we’ll review our code solution. We’ll also present a method to assess the feasibility of a text classifier. The method aims to mimic a relationship between the input text to the output class. But since we want to have that method suit a variable that is text and not numeric, we’ll go back to the origin and calculate a measure for the statistical dependency between the input text and the output class. Statistical dependency is the most basic measure for a relationship between variables and thus doesn’t require either of them to be numeric.
假设可行性研究成功,我们就可以继续实施机器学习解决方案。
Assuming the feasibility study is successful, we can move on to implementing the ML solution.
这部分是机器学习开发人员的专业知识发挥作用的地方。它有不同的步骤,开发人员根据问题选择相关的步骤 - 无论是数据清理、文本分割、特征设计、模型比较还是指标选择。
This part is where the expertise of the ML developer comes into play. There are different steps for it and the developer chooses which ones are relevant based on the problem – whether it’s data cleaning, text segmentation, feature design, model comparison, or metric choice.
我们将在回顾我们已解决的具体用例时详细说明这一点。
We will elaborate on this as we review the specific use case we’ve solved.
我们根据所选指标评估解决方案。这部分需要一些经验,因为随着时间的推移,机器学习开发人员往往会在这方面做得更好。这项任务的主要缺陷是无法对结果进行客观评估。这种客观评估是通过将完成的模型应用于以前从未“见过”的数据来完成的。但通常那些刚刚开始应用机器学习的人在看到保留集的结果后发现自己正在改进他们的设计。这导致了一个反馈循环,其中设计实际上适合不再保留的集合。虽然这确实可以改进模型和设计,但它无法提供客观预测模型在现实世界中实施时的表现的能力。在现实世界中,它会看到真正保留但不适合的数据。
We evaluate the solution given the metric that was chosen. This part requires some experience as ML developers tend to get better at this over time. The main pitfall in this task is the ability to set up an objective assessment of the result. That objective assessment is done by applying the finished model to data it had never “seen” before. But often folks who are only starting to apply ML find themselves improving their design after seeing what the results of that held-out set are. This leads to a feedback loop where the design is practically fitted to the no-longer-held-out set. While this may indeed improve the model and the design, it takes away from the ability to provide an objective forecast of how the model would perform when implemented in the real world. In the real world, it would see data that is truly held out and that it wasn’t fitted to.
通常,当设计完成、实施完成并且结果令人满意时,工作将被提交以供业务实施,或在研究环境中进行出版。在业务环境中,实施可以采取不同的形式。
Typically, when the design is done, the implementation is complete, and the results have been found satisfactory, the work is presented for business implementation, or in the research setting, for publication. In the business setting, implementation can take on different forms.
最简单的形式之一是使用输出来提供业务见解。其目的是要呈现。例如,当评估营销活动对销售额增长的贡献程度时,机器学习团队可能会计算对该贡献衡量标准的估计值,并将其提交给领导层。
One of the simplest forms is where the output is used to provide business insights. Its purpose is to be presented. For instance, when looking to evaluate how much a marketing campaign was contributing to the growth in sales, the ML team may calculate an estimation for that measure of contribution and present it to leadership.
另一种实施形式是在仪表板中实时实施。例如,该模型计算患者前往急诊室的预测风险,并且每天都会进行计算。汇总结果并在医院仪表板上显示图表,显示未来 30 天内每天前往急诊室的预计人数。
Another form of implementation is within a dashboard in real time. For instance, the model calculates the predicted risk of patients coming to the emergency room, and it does so on a daily cadence. The results are aggregated and a graph is presented on the hospital dashboard to show the expected number of people who would come to the emergency room for every day of the next 30 days.
更高级和常见的形式是定向数据的输出,以便将其输入下游任务。然后,该模型将在生产中实施,成为更大生产管道中的微服务。一个例子是分类器评估公司 Facebook 页面上的每个帖子。当它识别出攻击性语言时,它会输出一个检测结果,然后沿着管道传递到另一个系统,该系统会删除该帖子并可能阻止该用户。
A more advanced and common form is when the output of the data is directed so that it can be fed into downstream tasks. The model would then be implemented in production to become a microservice within a larger production pipeline. An example of that is when a classifier evaluates every post on your company’s Facebook page. When it identifies offensive language, it outputs a detection that then passes down the pipeline to another system that removes that post and perhaps blocks that user.
一旦工作完成,代码的设计应该适合代码的目的。根据前面提到的不同实现形式,某些实现规定了特定的代码结构。例如,当已完成的代码在更大的现有管道中交付生产时,生产工程师将向 ML 团队规定约束。这些限制可能与计算和计时资源有关,但也与代码设计有关。通常,基本代码文件(例如.py文件)是必需的。
The code’s design should suit the purpose of the code once the work is complete. As per the different forms of implementation mentioned previously, some implementations dictate a specific code structure. For instance, when the completed code is handed off to production within a larger, already existing pipeline, it is the production engineer who would dictate the constraints to the ML team. These constraints may be around computation and timing resources, but they would also be around code design. Often, basic code files, such as .py files, are necessary.
与使用代码进行演示的情况一样,例如在演示营销活动的贡献程度的示例中,Jupyter Notebooks 可能是更好的选择。
As with cases where the code is used for presentations, such as in the example of presenting how contributive the marketing campaign was, Jupyter Notebooks may be the better choice.
Jupyter Notebooks 可以提供非常丰富的信息和指导性。出于这个原因,许多 ML 开发人员使用 Jupyter Notebooks 开始他们的项目的探索阶段。
Jupyter Notebooks can be very informative and instructional. For that reason, many ML developers start their projects with Jupyter Notebooks for the exploration phase.
接下来,我们将在 Jupyter Notebook 中回顾我们的设计。这将使我们能够将整个过程封装在一个旨在呈现给读者的连贯文件中。
Next, we will review our design in a Jupyter Notebook. This will allow us to encapsulate the entire process in a single coherent file that is meant to be presented to the reader.
在本节中,我们将介绍一个实践示例。我们将按照之前提出的步骤来阐明问题、设计解决方案并评估结果。本节描述了 ML 开发人员在从事行业中的典型项目时所经历的过程。有关详细信息,请参阅https://colab.research.google.com/drive/1ZG4xN665le7X_HPcs52XSFbcd1OVaI9R?usp=sharing上的笔记本。
In this section, we will walk through a hands-on example. We will follow the steps we presented previously for articulating the problem, designing the solution, and evaluating the results. This section portrays the process that an ML developer goes through when working on a typical project in the industry. Refer to the notebook at https://colab.research.google.com/drive/1ZG4xN665le7X_HPcs52XSFbcd1OVaI9R?usp=sharing for more information.
在这个场景中,我们正在为一家财经通讯社工作。我们的目标是实时发布有关公司和产品的新闻。
In this scenario, we are working for a financial news agency. Our objective is to publish news about companies and products in real time.
CTO 从业务目标中得出几个技术目标。机器学习团队的一个目标是:给定一系列实时金融推文,检测那些讨论有关公司或产品信息的推文。
The CTO derives several technical objectives from the business objective. One objective is for the ML team: given a stream of financial tweets in real time, detect those tweets that discuss information about companies or products.
让我们回顾一下管道的不同部分,如图5 .2所示:
Let’s review the different parts of the pipeline, as shown in Figure 5.2:
图 5.2 – 典型机器学习管道的结构
Figure 5.2 – The structure of a typical ML pipeline
笔记
Note
图 5 .2中管道的各个阶段将在以下小节中进行探讨
The phases of the pipeline in Figure 5.2 are explored in the following subsections
在这部分代码中,我们设置了关键参数。我们选择将它们作为代码的一部分,因为这是为演示而制作的指导代码。如果代码预计投入生产,最好将参数托管在单独的.yaml文件中。这也适合开发阶段的大量迭代,因为它允许您迭代不同的代码参数,而无需更改代码,这通常是可取的。
In this part of the code, we set the key parameters. We choose to have them as a part of the code as this is instructional code made for presentation. In cases where the code is expected to go to production, it may be better to host the parameters in a separate .yaml file. That would also suit heavy iterations during the development phase as it will allow you to iterate over different code parameters without having to change the code, which is often desirable.
至于这些值的选择,需要强调的是,其中一些值应该进行优化,以适应解决方案的优化。我们在这里选择了固定措施来简化流程。例如,这里用于分类的特征数量是固定数量,但也应该对其进行优化以适应训练集。
As for the choice of these values, it should be stressed that some of these values should be optimized to suit the optimization of the solution. We have chosen fixed measures here to simplify the process. For instance, the number of features to be used for classification is a fixed quantity here, but it should also be optimized to fit the training set.
这部分加载数据集。在我们的例子中,加载函数很简单。在其他业务案例中,这部分可能非常大,因为它可能包含调用的 SQL 查询的集合。在这种情况下,最好在单独的.py文件中编写专用函数并通过导入部分获取它。
This part loads the dataset. In our case, the loading function is simple. In other business cases, this part could be quite large as it may include a collection of SQL queries that are called. In such a case, it may be ideal to write a dedicated function in a separate .py file and source it via the imports section.
在这里,我们以适合我们工作的方式格式化数据。其中一些我们也是第一次观察到。这使我们能够感受到它的本质和品质。
Here, we format the data in a way that suits our work. We also observe some of it for the first time. This allows us to get a feel of its nature and quality.
我们在这里采取的一项关键行动是定义我们关心的类。
One key action we take here is to define the classes we care about.
正如我们在第 4 章中讨论的,预处理是管道的关键部分。例如,我们注意到许多推文都有一个 URL,我们选择将其删除。
As we discussed in Chapter 4, preprocessing is a key part of the pipeline. For instance, we notice that many of the tweets have a URL, which we choose to remove.
至此,我们已经观察到了文本的质量和类别的分布。这是我们探索数据的任何其他特征的地方,这些特征可能暗示其质量或指示所需类别的能力。
At this point, we have observed the quality of the text and the distribution of the classes. This is where we explore any other characteristics of the data that may imply either its quality or its ability to indicate the desired class.
接下来,我们开始处理文本。我们试图将每个观察的文本表示为一组数字特征。造成这种情况的主要原因是传统的机器学习模型被设计为接受数字作为输入,而不是文本。例如,常见的线性回归或逻辑回归模型应用于数字,而不是单词、类别或图像像素。因此,我们需要建议文本的数字表示。当使用BERT和GPT等语言模型时,这种设计约束就被解除了。我们将在接下来的章节中看到这一点。
Next, we start processing the text. We seek to represent the text of each observation as a set of numerical features. The main reason for this is that traditional ML models are designed to accept numbers as input, not text. For instance, a common linear regression or logistic regression model is applied to numbers, not words, categories, or image pixels. Thus, we need to suggest a numeric representation for the text. This design constraint is lifted when working with language models such as BERT and GPT. We will see this in the coming chapters.
我们将文本划分为 N 元语法,其中N是代码的参数。N在此代码中是固定的,但应进行优化以最适合训练集。
We partition the text into N-grams, where N is a parameter of the code. N is fixed in this code but should be optimized to best fit the training set.
一旦文本被划分为 N 元语法,它们就会被建模为数值。当选择二进制(即one-hot 编码)方法时,当观察到的文本包含该 N-gram 时,表示某个 N-gram 的数字特征将获得“1”,否则为“0”。示例见图5 .3 。如果选择 BOW 方法,则特征值就是 N 元语法在观察到的文本中出现的次数。此处未实现的另一种常见特征工程方法是TF-IDF。
Once the text has been partitioned into N-grams, they are modeled as numeric values. When a binary (that is, one-hot encoding) method is chosen, the numerical feature that represents some N-gram gets a “1” when the observed text includes that N-gram, and “0” otherwise. See Figure 5.3 for an example. If a BOW approach is chosen, then the value of the feature is the number of times the N-gram appears in the observed text. Another common feature engineering method that isn’t implemented here is TF-IDF.
这是我们仅使用一元组得到的结果:
Here’s what we get by using unigrams only:
输入句子:“备案已提交。”
Input sentence: “filing submitted.”
|
N-gram N-gram |
“报告” “report” |
“归档” “filing” |
“已提交” “submitted” |
“产品” “product” |
“每季度” “quarterly” |
其余的一元词 The rest of the unigrams |
|
特征值 Feature value |
0 0 |
1 1 |
1 1 |
0 0 |
0 0 |
( 0 的) (0’s) |
图 5.3 – 通过 one-hot 编码划分为一元组,将输入文本句子转换为数字表示
Figure 5.3 – Transforming an input text sentence into a numerical representation by partitioning to unigrams via one-hot encoding
下图显示了我们使用一元词组和二元词组得到的结果:
The following figure shows what we get by using both unigrams and bigrams:
|
N-gram N-gram |
“报告” “report” |
“归档” “filing” |
“备案已提交” “filing submitted” |
“报道新闻” “report news” |
“已提交” “submitted” |
其余的N 元语法 The rest of the N-grams |
|
特征值 Feature value |
0 0 |
1 1 |
1 1 |
0 0 |
1 1 |
( 0 的) (0’s) |
图 5.4 – 通过 one-hot 编码划分为一元组和二元组,将输入文本句子转换为数字表示
Figure 5.4 – Transforming an input text sentence into a numerical representation by partitioning to unigrams and bigrams via one-hot encoding
笔记此时在代码中,数据集尚未划分为训练集和测试集,并且尚未排除保留集。这是因为二进制和 BOW 特征工程方法不依赖于基础观察之外的数据。对于 TF-IDF,情况有所不同。每个特征值都是使用文档频率的整个数据集来计算的。
Note that at this point in the code, the dataset hasn’t been partitioned into train and test sets, and the held-out set has not been excluded yet. This is because the binary and BOW feature engineering methods don’t depend on data outside of the underlying observation. With TF-IDF, this is different. Every feature value is calculated using the entire dataset for the document frequency.
现在我们的文本已被表示为一个特征,我们可以用数字来探索它。我们可以查看它的频率和统计数据并了解它的分布情况。
Now that our text has been represented as a feature, we can explore it numerically. We can look at its frequencies and statistics and get a sense of how it’s distributed.
这是我们必须暂停并划分出保留集(也称为测试集)的部分,并且有时作为验证集。由于这些术语在不同来源中的使用方式不同,因此必须解释一下我们所说的测试集是保留集。保留集是我们致力于评估解决方案性能的数据子集。它是为了模拟系统在现实世界中实现并遇到新的数据样本时我们期望得到的结果。
This is the part where we must pause and carve out a held-out set, also known as a test set, and sometimes as the validation set. Since these terms are used differently in different sources, it is important to explain that what we refer to as a test set is a held-out set. A held-out set is a data subset that we dedicate to evaluating our solution’s performance. It is held out to simulate the results that we expect to get when the system is implemented in the real world and will encounter new data samples.
我们如何知道何时切出保留的集合?
How do we know when to carve out the held-out set?
如果我们“太早”将其切出,例如在加载数据后立即切出,那么我们保证将其保留,但我们可能会错过数据中的差异,因为它不会参与初步探索。如果我们“为时已晚”,我们的设计决策可能会因此而变得有偏见。例如,如果我们根据包含可能保留的集合的结果选择一个 ML 模型而不是另一个模型,那么我们的设计就会针对该集合进行定制,从而阻止我们提供模型的客观评估。
If we carve it out “too early,” such as right after loading the data, then we are guaranteed to keep it held out, but we may miss discrepancies in the data as it won’t take part in the preliminary exploration. If we carve it out “too late,” our design decisions might become biased because of it. For example, if we choose one ML model over another based on results that include the would-be held-out set, then our design becomes tailored to that set, preventing us from offering an objective evaluation of the model.
然后我们需要在将纳入设计决策的第一个操作之前执行测试集。在下一节中,我们将执行统计分析,然后将其输入到特征选择中。由于该选择应该与保留的集合无关,因此我们将从本部分开始排除该集合。
Then, we need to carry out the test set right before the first action that will feed into design decisions. In the next section, we’ll perform statistical analysis, which we can then feed into feature selection. Since that selection should be agnostic to the held-out set, we’ll exclude that set from this part onwards.
这是我们几页前谈到的探索阶段的第二部分。首先部分是数据探索,我们在代码的前面部分中实现了它。现在我们已经将文本表示为数字特征,我们可以进行可行性研究。
This is the second part of the exploration phase we spoke about a few pages ago. The first part was data exploration, and we implemented that in the previous parts of the code. Now that we have the text represented as numerical features, we can perform the feasibility study.
我们寻求测量文本输入和类别值之间的统计依赖性。同样,其动机是模仿线性相关为回归问题提供的代理。
We seek to measure the statistical dependence between the text inputs and the class values. Again, the motivation is to mimic the proxy that linear correlation provides with a regression problem.
我们知道,对于两个随机变量X和Y,如果它们在统计上独立,那么我们得到以下结果:
We know that for two random variables, X and Y, if they are statistically independent, then we get the following:
或者,我们得到以下结果:
Alternatively, we get the following:
对于每个产生非零概率的x、y值都会发生这种情况。
This happens for every x, y value that yields a non-zero probability.
相反,我们可以使用贝叶斯规则:
Conversely, we could use Bayes’s rules:
现在,让我们考虑任意两个不一定在统计上独立的随机变量。我们想评估两者之间是否存在统计关系。
Now, let’s think about any two random variables that aren’t necessarily statistically independent. We would like to evaluate whether there is a statistical relationship between the two.
让一个随机变量是我们的任何数值特征,另一个随机变量是输出类采用值 0 或 1。我们假设特征工程方法是二进制的,因此该特征也采用值 0或 1。
Let one random variable be any of our numerical features, and the other random variable be the output class taking on values 0 or 1. Let’s assume the feature engineering method is binary, so the feature also takes on values of 0 or 1.
查看最后一个方程,左侧的表达式可以非常有力地衡量X 和Y之间关系的能力:
Looking at the last equation, the expression on the left-hand side presents a very powerful measure of the ability of the relationship between X and Y:
它很强大,因为如果该特征完全不指示类别值,那么用统计术语来说,我们说两者在统计上是独立的,因此该度量将等于1。
It is powerful because if the feature is completely nonindicative of the class value, then in statistical terms, we say the two are statistically independent, and thus this measure would be equal to 1.
相反,该度量与 1 之间的差异越大,该特征与该类别之间的关系越强。在对我们的设计进行可行性研究时,我们希望看到数据中存在与输出类具有统计关系的特征。
Conversely, the bigger the difference between this measure and 1, the stronger the relationship is between this feature and this class. When performing a feasibility study of our design, we want to see that there are features in the data that have a statistical relationship with the output class.
因此,我们为每个特征和每个类的每一对计算该表达式的值。
For that reason, we calculate the value of this expression for every pair of every feature and every class.
我们提供了“0”类最具指示性的术语,即不指示公司或产品信息的推文类别,并且我们还提供了“1”类最具指示性的术语,即当一条推文正在讨论有关公司或产品的信息。
We present the most indicative terms for class “0,” which is the class of tweets that don’t indicate a company or product information, and we also present the terms that are most indicative of class “1,” meaning, when a tweet is discussing information about a company or a product.
这向我们证明确实存在指示类别值的文本术语。这是可行性研究的明确而明显的成功。我们一切顺利,我们期待在实施分类模型时取得富有成效的成果。
This proves to us that there are indeed text terms that are indicative of the class value. This is a definite and clear success of the feasibility study. We are good to go and we are expecting productive outcomes when implementing a classification model.
附带说明一下,请记住,与大多数评估一样,我们刚才提到的只是文本预测类别的潜力的一个充分条件。如果失败了,并不一定说明没有可行性。就像线性时一样X和Y之间的相关性接近 0,这并不意味着X不能推断Y。这只是意味着X无法通过线性模型推断Y。线性是一个假设,如果它盛行的话,是为了让事情变得简单。
As a side note, keep in mind that as with most evaluations, what we’ve just mentioned is just one sufficient condition for the potential of the text to predict the class. If it had failed, it would not necessarily indicate that there is no feasibility. Just like when the linear correlation between X and Y is near 0, this doesn’t mean that X can’t infer Y. It just means that X cannot infer Y via a linear model. The linearity is an assumption that’s made to make things simple if it prevails.
在我们建议的方法中,我们做出两个关键假设。首先,我们假设一种非常特殊的特征设计方式,即 N-gram 划分的某个N,以及值的某种定量方法——二进制。第二个是我们对统计依赖性(单变量统计依赖性)进行最简单的评估。但可能只有更高阶的变量(例如单变量)才会对结果类别具有统计依赖性。
In the method that we’ve suggested, we make two key assumptions. First, we assume a very particular manner for feature design, being a certain N for the N-gram partition, and a certain quantitative method for the value – binary. The second is that we perform the most simple evaluation of statistical dependency, a univariate statistical dependency. But it could be that only a higher order, such as univariate, would have statistical dependence on the outcome class.
通过文本分类的可行性研究,如果方法尽可能简单,同时覆盖尽可能多的希望发现的“信号”,那就是理想的。我们在本示例中设计的方法是根据多年不同集合和各种问题设置的经验得出的。我们发现它很好地达到了目标。
With a feasibility study of text classification, it’s ideal if the method is as simple as possible while covering as much of the “signal” it is hoping to uncover. The approach we designed in this example was derived after years of experience with different sets and various problem settings. We find that it hits the target very well.
在可行性研究中,我们经常一石二鸟。可行性研究成功后,它不仅可以帮助我们确认我们的计划,而且常常暗示下一步的计划我们应该采取的步骤。正如我们所看到的,一些特征代表了该类,并且我们了解到哪些特征是最重要的。这使我们能够减少分类模型需要划分的特征空间。我们通过为两个类别中的每一个保留最具指示性的特征来做到这一点。理想情况下,我们选择保留的特征数量是由计算约束(例如,太多的特征将花费太长时间来计算模型)、模型能力(例如,太多的特征不能很好地处理)导出的。模型由于共线性),以及训练结果的优化。在我们的代码中,我们修复了这个数字以使事情变得快速而简单。
With the feasibility study, we often kill two birds with one stone. As a feasibility study is successful, it not only helps us by confirming our plan, but it often hints toward the next steps that we should take. As we saw, some features are indicative of the class, and we learned which are the most significant. This allows us to reduce the feature space that the classification model will need to partition. We do that by keeping the most indicative features for each of the two classes. The number of features that we choose to keep would ideally be derived by computation constraints (for example, too many features would take too long to compute a model around), model capabilities (for example, too many features can’t be handled well by the model due to co-linearity), and optimization of the train results. In our code, we fixed this number to make things quick and simple.
应该强调的是,在许多机器学习模型中,特征选择是继承的部分模型设计。例如,使用最小绝对收缩和选择运算符(LASSO),范数组件的超参数缩放器会影响哪些特征获得零系数,从而被“抛弃”。可以并且有时建议跳过这部分特征选择过程,保留所有特征,并让模型执行特征选择。当所有正在评估和比较的模型都具有该特征时,建议这样做。
It should be stressed that in many ML models, feature selection is an inherited part of the model design. For instance, with the least absolute shrinkage and selection operator (LASSO), the hyperparameter scaler of the norm component has an impact on which features get a zero coefficient, and thus get “thrown out.” It is possible and sometimes recommended to skip this part of the feature selection process, leave all features in, and let the model perform feature selection. It is advised to do so when all the models that are being evaluated and compared possess that characteristic.
记住此时,我们只观察火车组。现在我们已经决定保留哪些功能,我们还需要将该选择应用于测试集。
Remember that at this point, we are only observing the train set. Now that we have decided which features to keep, we need to apply that selection to the test set as well.
这样,我们的数据就已经准备好用于机器学习建模了。
With that, our data has been prepared for ML modeling.
为了选择最适合这个问题的模型,我们必须训练多个模型并看看其中哪个模型表现最好。
To choose which model suits this problem best, we must train several models and see which one of them does best.
我们应该强调的是,我们可以做很多事情来尝试确定给定情况下的最佳模型选择问题。在我们的例子中,我们只选择评估少数模型。此外,为了使事情变得简单快捷,我们选择不在综合交叉验证方法中优化每个模型的超参数。我们只需使用其函数附带的默认设置将每个模型拟合到训练集即可。一旦我们确定了想要使用的模型,我们就通过交叉验证来优化其训练集的超参数。
We should stress that we could do many things to try and identify the best model choice for a given problem. In our case, we only chose to evaluate a handful of models. Moreover, to make things simple and quick, we chose to not optimize the hyperparameters of each model in a comprehensive cross-validation approach. We simply fit each model to the training set with the default settings that its function comes with. Once we’ve identified the model we’d like to use, we optimize its hyperparameters for the train set via cross-validation.
通过这样做,我们确定了问题的最佳模型。
By doing this, we identify the best model for the problem.
Here, we optimize the hyperparameters of the chosen model and fit it to our train set.
在这阶段,我们第一次观察模型的结果。该结果可用于反馈设计选择和所选参数,例如特征工程方法、特征选择中剩余的特征数量,甚至预处理方案。
At this stage, we observe the results of the model for the first time. This result can be used to feed insight back into the design choice and the parameters chosen, such as the feature engineering method, the number of features left in the feature selection, and even the preprocessing scheme.
重要的提示
Important note
请注意,当将训练集结果的见解反馈到解决方案的设计时,您面临着过度拟合训练集的风险。您将通过训练集结果和测试集结果之间的差距知道自己是否处于领先地位。
Note that when feeding back insights from the results of the train set to the design of the solution, you are risking overfitting the train set. You’ll know whether you are by the gap between the results on the train set and the results on the test set.
尽管预计这些结果之间存在有利于训练结果的差距,较大的差距应被视为设计不是最佳的警报。在这种情况下,应使用系统的基于代码的参数重新进行设计,以确保做出公平的选择。甚至可以从训练集中划分出另一个半保留集,通常称为验证集。
While a gap is expected between these results in favor of the train results, a large gap should be treated as an alarm that the design isn’t optimal. In such cases, the design should be redone with systematic code-based parameters to ensure fair choices are made. It is possible to even carve out another semi-held-out set from the train set, often referred to as the validation set.
就是这样!
That’s it!
现在该设计已经过优化,并且我们相信它符合我们的目标,我们可以将其应用到我们保留的集合中并观察测试结果。这些结果是我们对系统在现实世界中的表现的最客观的预测。
Now that the design has been optimized and we are confident that it suits our objective, we can apply it to our held-out set and observe the test results. These results are our most objective forecast of how well the system would do in the real world.
As mentioned previously, we should avoid letting these results impact our design choices.
在本章中,我们对文本分类进行了全面的探索,这是 NLP 和 ML 中不可或缺的一个方面。我们深入研究了各种类型的文本分类任务,每种任务都带来了独特的挑战和机遇。这种基本理解为有效处理从情感分析到垃圾邮件检测的广泛应用奠定了基础。
In this chapter, we embarked on a comprehensive exploration of text classification, an indispensable aspect of NLP and ML. We delved into various types of text classification tasks, each presenting unique challenges and opportunities. This foundational understanding sets the stage for effectively tackling a broad range of applications, from sentiment analysis to spam detection.
我们详细介绍了 N 元语法在捕获文本中的本地上下文和单词序列方面的作用,从而增强了用于分类任务的特征集。我们还阐述了 TF-IDF 方法的强大功能、Word2Vec 在文本分类中的作用以及 CBOW 和 Skip-Gram 等流行架构,让您深入了解它们的机制。
We walked through the role of N-grams in capturing local context and word sequences within text, thereby enhancing the feature set used for classification tasks. We also illuminated the power of the TF-IDF method, the role of Word2Vec in text classification, and popular architectures such as CBOW and skip-gram, giving you a deep understanding of their mechanics.
然后,我们介绍了主题建模,并研究了如何将 LDA 等流行算法应用于文本分类。
Then, we introduced topic modeling and examined how popular algorithms such as LDA can be applied to text classification.
最后,我们介绍了在商业或研究环境中领导 NLP-ML 项目的专业范例。我们讨论了目标和项目设计方面,然后深入到系统设计。我们在代码中实现了一个真实的示例并对此进行了实验。
Lastly, we introduced a professional paradigm for leading an NLP-ML project in a business or research setting. We discussed the objectives and the project design aspect, and then dove into the system design. We implemented a real-world example in code and experimented with this.
从本质上讲,本章旨在通过触及该领域的关键概念、方法和技术,让您全面了解文本分类和主题建模。所传授的知识和技能将使您能够有效地处理和解决现实世界的文本分类问题。
In essence, this chapter has aimed to equip you with a holistic understanding of text classification and topic modeling by touching on the key concepts, methodologies, and techniques in the field. The knowledge and skills imparted will enable you to effectively approach and solve real-world text classification problems.
在下一章中,我们将介绍文本分类的高级方法。我们将回顾语言模型等深度学习方法,讨论它们的理论和设计,并以代码形式展示实践系统设计。
In the next chapter, we will introduce advanced methods for text classification. We will review deep learning methods such as language models, discuss their theory and design, and present a hands-on system design in code.
在本章中,我们深入探讨深入学习(DL )领域及其在自然语言处理(NLP)中的应用,特别关注在开创性的基于 Transformer 的模型,例如Transformers 的双向编码器表示( BERT ) 和生成预训练 Transformer ( GPT )。我们首先介绍深度学习的基础知识,阐明了其从大量数据中学习复杂模式的强大能力,使其成为最先进的NLP 系统的基石。
In this chapter, we delve into the realm of deep learning (DL) and its application in natural language processing (NLP), specifically focusing on the groundbreaking transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) and generative pretrained transformer (GPT). We begin by introducing the fundamentals of DL, elucidating its powerful capability to learn intricate patterns from large amounts of data, making it the cornerstone of state-of-the-art NLP systems.
接下来,我们深入研究 Transformer,这是一种新颖的架构,通过提供更有效的方法彻底改变了 NLP与传统的循环神经网络( RNN ) 和卷积神经网络( CNN )相比,处理序列数据的性能更好。我们拆开Transformer独特的包装特征,包括其注意力机制,使其能够专注于输入序列的不同部分,以更好地理解上下文。
Following this, we delve into transformers, a novel architecture that has revolutionized NLP by offering a more effective method of handling sequence data compared to traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs). We unpack the transformer’s unique characteristics, including its attention mechanisms, which allow it to focus on different parts of the input sequence to better understand the context.
然后,我们将注意力转向 BERT 和 GPT,它们是基于 Transformer 的语言模型,它们利用这些优势来创建高度细致的语言表示。我们提供了 BERT 架构的详细分解,讨论了其创新地使用双向训练来生成上下文丰富的词嵌入。我们将揭开 BERT 内部工作原理的神秘面纱,并探索其预训练过程,该过程利用大量文本语料库来学习语言语义。
Then, we turn our attention to BERT and GPT, transformer-based language models that leverage these strengths to create highly nuanced language representations. We provide a detailed breakdown of the BERT architecture, discussing its innovative use of bidirectional training to generate contextually rich word embeddings. We will demystify the inner workings of BERT and explore its pretraining process, which leverages a large corpus of text to learn language semantics.
最后,我们讨论如何针对特定任务(例如文本分类)微调 BERT。我们将引导您完成从数据预处理和模型配置到训练和评估的各个步骤,让您亲身了解如何利用 BERT 的强大功能进行文本分类。
Finally, we discuss how BERT can be fine-tuned for specific tasks, such as text classification. We walk you through the steps, from data preprocessing and model configuration to training and evaluation, providing a hands-on understanding of how to leverage BERT’s power for text classification.
本章对 NLP 中的 DL 进行了全面的探索,从基础概念到实际应用,为您提供了利用 BERT 和转换器模型的功能来完成文本分类任务的知识。
This chapter provides a thorough exploration of DL in NLP, moving from foundational concepts to practical applications, equipping you with the knowledge to harness the capabilities of BERT and transformer models for your text classification tasks.
The following topics are covered in this chapter:
To successfully navigate through this chapter, certain technical prerequisites are necessary, as follows:
这些先决条件旨在为您提供必要的背景知识,以理解和实施本章中讨论的概念。有了这些,您就应该做好充分准备,使用DL 进行文本分类的迷人世界伯特。
These prerequisites are intended to equip you with the necessary background to understand and implement the concepts discussed in the chapter. With these in place, you should be well-prepared to delve into the fascinating world of DL for text classification using BERT.
在这一部分,我们解释什么神经网络和深度神经网络是什么,使用它们的动机是什么,以及深度学习的不同类型(架构)德尔斯。
In this part, we explain what neural network and deep neural networks are, what is the motivation for using them, and the different types (architectures) of deep learning models.
神经网络是人工智能( AI ) 和 ML的一个子领域,专注于算法灵感来自于大脑的结构和功能。它也被称为“深度”学习,因为这些神经网络通常由许多重复层组成,从而创建了深层架构。
Neural networks are a subfield of artificial intelligence (AI) and ML that focuses on algorithms inspired by the structure and function of the brain. It is also known as “deep” learning because these neural networks often consist of many repetitive layers, creating a deep architecture.
这些DL模型能够从大量复杂、高维和非结构化数据中“学习”。术语“学习”是指模型从经验中自动学习和改进的能力,而无需对其所学习的任务中的任何一项特定任务进行明确编程。
These DL models are capable of “learning” from large volumes of complex, high-dimensional, and unstructured data. The term “learning” refers to the ability of the model to automatically learn and improve from experience without being explicitly programmed to do so for any one particular task of the tasks it learns.
深度学习可以是有监督的、半监督的或无监督的。它被用于许多应用,包括 NLP、语音识别、图像识别,甚至玩游戏。这些模型可以识别模式并做出数据驱动的预测或决策。
DL can be supervised, semi-supervised, or unsupervised. It’s used in numerous applications, including NLP, speech recognition, image recognition, and even playing games. The models can identify patterns and make data-driven predictions or decisions.
关键优势之一深度学习的特点是它能够处理和建模各种类型的数据,包括文本、图像、声音等。这种多功能性带来了广泛的应用,从自动驾驶汽车到复杂的网络搜索算法和高度响应的语音识别系统。
One of the critical advantages of DL is its ability to process and model data of various types, including text, images, sound, and more. This versatility has led to a vast range of applications, from self-driving cars to sophisticated web search algorithms and highly responsive speech recognition systems.
值得注意的是,深度学习尽管潜力巨大,但也需要强大的计算能力和大量高质量数据才能有效训练,这可能是一个挑战。
It’s worth noting that DL, despite its high potential, also requires significant computational power and large amounts of high-quality data to train effectively, which can be a challenge.
从本质上讲,深度学习是一种强大的以及处于当今许多技术广告最前沿的变革性技术进步。
In essence, DL is a powerful and transformative technology that is at the forefront of many of today’s technological advancements.
神经网络在机器学习和人工智能领域的应用有多种原因。这里有一些关键的动机:
Neural networks are used for a variety of reasons in the field of ML and artificial intelligence. Here are some of the key motivations:
此外,由于多种原因,神经网络在 NLP 任务中被广泛使用。以下是一些主要的动机:
Additionally, neural networks are extensively used in NLP tasks due to several reasons. Here are some of the primary motivations:
同样,神经网络可以学习从原始文本数据执行 NLP 任务,而无需手动提取特征。这是 NLP 的一大优势,因为创建手工设计的特征可能具有挑战性,并且耗时。
Similarly, neural networks can learn to perform NLP tasks from raw text data without the need for manual feature extraction. This is a big advantage in NLP, where creating hand-engineered features can be challenging and time-consuming.
尽管有这些优点,但值得注意的是,神经网络也有其挑战,包括其“黑匣子”性质,这使得其决策过程难以解释,以及需要大量数据和计算资源进行训练。然而,它们在性能方面提供的优势以及从原始文本数据中学习和建模复杂关系的能力使它们成为许多NLP 任务。
Despite these advantages, it’s worth noting that neural networks also have their challenges, including their “black box” nature, which makes their decision-making process difficult to interpret, and their need for large amounts of data and computational resources for training. However, the benefits they provide in terms of performance and their ability to learn from raw text data and model complex relationships make them a go-to choice for many NLP tasks.
神经网络包括由多层互连的节点或“神经元”组成,每个节点对其接收的数据执行简单的计算,将其输出传递给下一层的神经元。神经元之间的每个连接都有一个相关的权重,该权重在学习过程中进行调整。
A neural network consists of multiple layers of interconnected nodes, or “neurons,” each of which performs a simple computation on the data it receives, passing its output to the neurons of the next layer. Each connection between neurons has an associated weight that is adjusted during the learning process.
基本神经网络的架构由三种类型的层组成,如图6 .1所示:
The architecture of a basic neural network consists of three types of layers, as shown in Figure 6.1:
图 6.1 – 神经网络的基本架构
Figure 6.1 – Basic architecture of neural networks
In the following list, we explain each layer of the model in more detail:
网络中的神经元是相互连接的。这些连接的权重最初设置为随机值,代表网络在接受数据训练后所学到的内容。
The neurons in the network are interconnected. The weights of these connections, which are initially set to random values, represent what the network has learned once it has been trained on data.
在训练过程中,使用反向传播等算法来调整网络中连接的权重,以响应网络输出与期望输出之间的差异。这个过程重复多次,网络在训练数据上的性能逐渐提高。
During the training process, an algorithm such as backpropagation is used to adjust the weights of the connections in the network in response to the difference between the network’s output and the desired output. This process is repeated many times, and the network gradually improves its performance on the training data.
为了提供一个简单的视觉想法,想象一下三组圆圈(代表神经元)排列成列(代表层)。第一列是输入层,最后一列是输出层,中间的任何列都是隐藏层。然后,想象将每列中的每个圆圈与下一列中的每个圆圈连接起来的线,代表神经元之间的加权连接。这是神经网络的基本视觉表示。
To provide a simple visual idea, imagine three sets of circles (representing neurons) arranged in columns (representing layers). The first column is the input layer, the last column is the output layer and any columns in between are the hidden layers. Then, imagine lines connecting every circle in each column to every circle in the next column, representing the weighted connections between neurons. That’s a basic visual representation of a neural network.
In the next part, we are going to describe the common terms related to neural networks.
在以下小节中,我们将讨论本文中一些最常用的术语神经网络的xt 。
In the following subsections, we'll look at some of the most commonly used terms in the context of neural networks.
这是基本的计算单位在神经网络中;通常,简单的计算涉及输入、权重、偏差和激活函数。神经元,也称为节点或单元,是基本元素在神经网络中。如果神经元位于输入层,它会从其他一些节点或外部源接收输入。然后神经元根据该输入计算输出。
This is the basic unit of computation in a neural network; typically, a simple computation involves inputs, weights, a bias, and an activation function. A neuron, also known as a node or unit, is a fundamental element in a neural network. It receives input from some other nodes or from an external source if the neuron is in the input layer. The neuron then computes an output based on this input.
每个输入都有一个关联的权重 ( w ),该权重是根据其与其他输入的相对重要性来分配的。神经元对输入应用权重,将它们相加,然后对总和加上偏置值 ( b ) 应用激活函数。
Each input has an associated weight (w), which is assigned based on its relative importance to other inputs. The neuron applies a weight to the inputs, sums them up, and then applies an activation function to the sum plus a bias value (b).
Here’s a step-by-step breakdown:
神经元的输出是激活函数的结果。它充当网络下一层神经元的输入。
The output of the neuron is the result of the activation function. It serves as the input to the neurons in the next layer of the network.
The weights and bias in the neuron are learnable parameters. In other words, their values are learned over time as the neural network is trained on data:
决定的函数(在每个神经元中)输出一个神经元给定输入时应该产生的结果称为激活函数。常见的例子包括 sigmoid、ReLU和 tanh。
The function (in each neuron) that determines the output a neuron should produce given its input is called an activation function. Common examples include sigmoid, ReLU and tanh.
以下是一些最常见的激活函数类型:
Here are some of the most common types of activation functions:
然而,它有两个主要缺点:梯度消失问题(梯度非常小)对于大量的正或负输入,这可能会减慢学习过程反向传播)并且输出不是以零为中心的。
However, it has two major drawbacks: the vanishing gradients problem (gradients are very small for large positive or negative inputs, which can slow down learning during backpropagation) and the outputs are not zero-centered.
It also suffers from the vanishing gradients problem, as does the sigmoid function.
换句话说,如果输入为正,则激活只是输入;否则,它为零。
它不会同时激活所有神经元,这意味着只有当线性变换的输出小于 0 时,神经元才会被停用。这使得网络稀疏且高效。然而,ReLU 单元在训练过程中可能很脆弱,如果大梯度流过它们,它们可能会“死亡”(它们完全停止学习)。
In other words, the activation is simply the input if the input is positive; otherwise, it’s zero.
It doesn’t activate all the neurons at the same time, meaning that the neurons will only be deactivated if the output of the linear transformation is less than 0. This makes the network sparse and efficient. However, ReLU units can be fragile during training and can “die” (they stop learning completely) if a large gradient flows through them.
This allows the function to “leak” some information when the input is negative and helps to mitigate the dying ReLU problem.
f ( x ) = x i f x > 0 , e l s e
α ( e x p ( x ) − 1 )
这里 alpha ( α ) 是一个常数,定义输入为负时函数的平滑度。ELU 倾向于更快地将成本收敛到零并产生更准确的结果。但是,由于使用指数运算,计算速度可能会较慢。
f(x) = x if x > 0, else
α(exp(x) − 1)
Here alpha (α) is a constant that defines function smoothness when inputs are negative. ELU tends to converge cost to zero faster and produce more accurate results. However, it can be slower to compute because of the use of the exponential operation.
分母对概率进行归一化,因此所有类别的概率之和均为 1。softmax 函数也用于多项逻辑回归。
The denominator normalizes the probabilities, so they all sum up to 1 across all classes. The softmax function is also used in multinomial logistical regression.
这些激活函数中的每一个各有利弊,激活函数的选择可以根据具体情况而定ic 应用程序和当前问题的背景。
Each of these activation functions has pros and cons, and the choice of activation function can depend on the specific application and context of the problem at hand.
一组神经元进行处理相同抽象级别的信号。第一层是输入层,最后一层是是输出层,其间的所有层称为隐藏层。
A set of neurons that process signals at the same level of abstraction. The first layer is the input layer, the last layer is the output layer, and all layers in between are called hidden layers.
在这样的背景下训练神经元在网络中,纪元是一个术语,用于表示对整个训练数据集的一次完整遍历。在一个时期内,神经网络的权重会更新,以尝试最小化损失函数。
In the context of training a neural network, an epoch is a term used to denote one complete pass through the entire training dataset. During an epoch, the neural network’s weights are updated in an attempt to minimize the loss function.
纪元数超参数设置深度学习算法处理整个训练数据集的次数。太多的纪元可能会导致过度拟合,即模型在训练数据上表现良好,但在新数据上表现不佳。相反,训练次数太少可能意味着模型拟合不足,可以通过进一步训练来改进。
The number of epochs hyperparameter sets how many times the deep learning algorithm processes the entire training dataset. Too many epochs can cause overfitting, where the model performs well on training data but poorly on new data. Conversely, training for too few epochs may mean the model is underfitting—it could improve with further training.
同样重要的是要注意,纪元的概念在梯度下降的批量和小批量变体中更为相关。在随机梯度下降中,模型的权重在看到每个单独的示例后更新e,所以纪元的概念不太简单。
It’s also important to note that the concept of an epoch is more relevant in the batch and mini-batch variants of gradient descent. In stochastic gradient descent, the model’s weights are updated after seeing each individual example, so the concept of an epoch is less straightforward.
的数量训练实例在一次迭代中使用。批量大小是指一次迭代中使用的训练示例的数量。
The number of training instances used in one iteration. Batch size refers to the number of training examples used in one iteration.
当你开始训练神经网络时,你就拥有了一个有关如何将数据输入模型的多种选项:
When you start training a neural network, you have a couple of options for how you feed your data into the model:
批量大小可以显着影响学习过程。较大的批量大小会导致训练进度更快,但收敛速度并不总是那么快。较小的批量大小会频繁更新模型,但训练进度较慢。
The batch size can significantly impact the learning process. Larger batch sizes result in faster progress in training but don’t always converge as fast. Smaller batch sizes update the model frequently but the progress in training is slower.
此外,较小的批量大小具有正则化效果,可以帮助模型更好地泛化,从而在未见过的数据上获得更好的性能。然而,使用太小的批量大小可能会导致训练不稳定、梯度估计不太准确,并最终导致模型性能较差。
Moreover, smaller batch sizes have a regularizing effect and can help the model generalize better, leading to better performance on unseen data. However, using a batch size that is too small can lead to unstable training, less accurate estimates of the gradient, and, ultimately, a model with worse performance.
选择正确的批量大小是一个反复试验的问题,取决于具体问题和现有的计算资源:
Choosing the right batch size is a matter of trial and error and depends on the specific problem and the computational resources at hand:
Let’s move on to the architecture of different neural networks next.
神经网络有多种类型,每种类型都有适合不同类型任务的特定架构。以下列表包含一些最常见类型的一般描述:
Neural networks come in various types, each with a specific architecture suited to a different kind of task. The following list contains general descriptions of some of the most common types:
图 6.2 – 前馈神经网络
Figure 6.2 – Feedforward neural network
图 6.3 – 多层感知器
Figure 6.3 – Multilayer perceptron
图 6.4 – 卷积神经网络
Figure 6.4 – Convolutional neural network
图 6.5 – 循环神经网络
Figure 6.5 – Recurrent neural network
图 6.6 – 自动编码器架构
Figure 6.6 – Autoencoder architecture
图 6.7 – 计算机视觉中的生成对抗网络
Figure 6.7 – Generative adversarial network in computer vision
这些只是神经网络架构的几个例子,还存在许多变化和组合。您为任务选择的架构将取决于您任务的具体要求和限制。
These are just a few examples of neural network architectures, and many variations and combinations exist. The architecture you choose for a task will depend on the specific requirements and constraints of your task.
训练神经网络是一项复杂的任务,在训练过程中会遇到挑战,例如局部极小值和梯度消失/爆炸,以及计算成本和int可解释性。所有挑战详细解释如下:
Training neural networks is a complex task and comes with challenges during the training, such as local minima and vanishing/exploding gradients, as well as computational costs and interpretability. All challenges are explained in detail in the following points:
这些挑战使得训练神经网络一项不平凡的任务,通常需要组合技术专长、计算资源以及反复试验的综合。
These challenges make training neural networks a non-trivial task, often requiring a combination of technical expertise, computational resources, and trial and error.
语言模型是一个统计模型NLP 中的模型旨在学习和理解人类语言的结构。更具体地说,它是一个概率模型,经过训练以估计给定单词场景时单词的可能性。例如,可以训练语言模型来根据前面的单词来预测句子中的下一个单词。
A language model is a statistical model in NLP that is designed to learn and understand the structure of human language. More specifically, it is a probabilistic model that is trained to estimate the likelihood of words when provided with a given word scenario. For instance, a language model could be trained to predict the next word in a sentence, given the previous words.
语言模型是许多 NLP 任务的基础。它们用于机器翻译、语音识别、词性标记和命名实体识别等。最近,它们已被用于创建对话式人工智能模型,例如聊天机器人和个人助理,并生成类似人类的文本。
Language models are fundamental to many NLP tasks. They are used in machine translation, speech recognition, part-of-speech tagging, and named entity recognition, among other things. More recently, they have been used to create conversational AI models such as chatbots and personal assistants and to generate human-like text.
传统的语言模型通常基于明确的统计方法,例如 n-gram 模型,该模型考虑了预测下一个单词时仅预测前 n 个单词,或隐藏马尔可夫模型( HMM )。
Traditional language models were often based on explicitly statistical methods, such as n-gram models, which consider only the previous n words when predicting the next word, or hidden Markov models (HMMs).
最近,神经网络在创建语言模型方面变得流行,导致神经语言模型的兴起。这些模型利用神经网络的力量在进行预测时考虑每个单词的上下文,从而获得更高的准确性和流畅性。神经语言模型的示例包括 RNN、Transformer 模型以及各种基于 Transformer 的架构,例如 BERT和 GPT。
More recently, neural networks have become popular for creating language models, leading to the rise of neural language models. These models use the power of neural networks to consider the context of each word when making predictions, resulting in higher accuracy and fluency. Examples of neural language models include RNNs, the transformer model, and various transformer-based architectures such as BERT and GPT.
语言模型对于在计算环境中理解、生成和解释人类语言至关重要,并且它们在许多领域发挥着至关重要的作用NLP 的应用。
Language models are essential for understanding, generating, and interpreting human language in a computational setting, and they play a vital role in many applications of NLP.
Here are several motivations for using language models:
所有这些动机都源于一个中心主题:语言模型帮助机器更有效地理解和生成人类语言,这对于当今数据驱动世界中的许多应用程序至关重要。
All these motivations stem from a central theme: language models help machines understand and generate human language more effectively, which is crucial for many applications in today’s data-driven world.
在下一节中,我们将介绍不同类型的学习和然后解释如何使用自我监督学习来训练语言模型。
In the following section, we introduce the different types of learning and then explain how one can use self-supervised learning to train language models.
半监督学习是机器学习的一种类型利用标记和未标记数据进行训练的方法。当您有少量标记数据和大量未标记数据时,它特别有用。这里的策略是使用标记数据来训练初始模型,然后使用该模型来预测未标记数据的标签。型号为使用新标记的数据进行重新训练,提高过程中的准确性。
Semi-supervised learning is a type of ML approach that utilizes both labeled and unlabeled data for training. It is particularly useful when you have a small amount of labeled data and a large amount of unlabeled data. The strategy here is to use the labeled data to train an initial model and then use this model to predict labels for the unlabeled data. The model is then retrained using the newly labeled data, improving its accuracy in the process.
无监督学习,关于另一个另一方面,涉及完全基于未标记数据的训练模型。这里的目标是找到数据中的潜在模式或结构。无监督学习包括聚类(其目的是将相似的实例分组在一起)和维度红色等技术uction(目的是简化数据而不丢失太多信息)。
Unsupervised learning, on the other hand, involves training models entirely on unlabeled data. The goal here is to find underlying patterns or structures in the data. Unsupervised learning includes techniques such as clustering (where the aim is to group similar instances together) and dimensionality reduction (where the aim is to simplify the data without losing too much information).
自我监督学习是无监督学习的一种形式数据提供监督。换句话说,模型学习根据同一输入数据的其他部分来预测输入数据的某些部分。它不需要人类提供明确的标签,因此称为“自我监督”。
Self-supervised learning is a form of unsupervised learning where the data provides the supervision. In other words, the model learns to predict certain parts of the input data from other parts of the same input data. It does not require explicit labels provided by humans, hence the term “self-supervised.”
在语言模型的背景下,自我监督通常是通过在给定句子的其他部分时预测句子的部分来实现的。例如,给定句子“The cat is on the __”,模型将接受培训预测缺失的单词(在本例中为“mat”)。
In the context of language models, self-supervision is typically implemented by predicting parts of a sentence when given other parts. For example, given the sentence “The cat is on the __,” the model would be trained to predict the missing word (“mat,” in this case).
我们来看看交流接下来介绍几种流行的用于训练语言模型的自监督学习策略。
Let’s look at a couple of popular self-supervised learning strategies for training language models next.
该策略在训练中使用BERT 随机屏蔽一定比例的输入标记,并让模型根据未屏蔽单词提供的上下文来预测屏蔽单词。例如,在句子“The cat is on the mat”中,我们可以屏蔽“cat”,模型的工作就是预测这个词。请注意,也可以屏蔽多个单词。
This strategy, used in the training of BERT, randomly masks some percentage of the input tokens and tasks the model with predicting the masked words based on the context provided by the unmasked words. For instance, in the sentence “The cat is on the mat,” we could mask “cat,” and the model’s job would be to predict this word. Please note that more than one word can also be masked.
从数学上讲,传销的目标是最大化以下可能性:
Mathematically, the objective of an MLM is to maximize the following likelihood:
其中w _i我sa 屏蔽词,w _{-i}是非屏蔽词,θ表示模型参数。
where w_i is a masked word, w_{-i} are the non-masked words, and θ represents the model parameters.
在自回归语言建模中,使用在 GPT 等模型中,模型会在给定所有前面的单词的情况下预测句子中的下一个单词。它经过训练,可以根据句子中的前一个单词最大化该单词出现的可能性。
In autoregressive language modeling, which is used in models such as GPT, the model predicts the next word in a sentence given all the preceding words. It’s trained to maximize the likelihood of a word given its previous words in the sentence.
自回归语言模型的目标是最大化
The objective of an autoregressive language model is to maximize
其中w_ i是当前单词,是之前的单词,θ表示模型参数。
where w_i is the current word, are the previous words, and θ represents the model parameters.
这些策略使语言模型可以直接从原始文本中获得对语言语法和语义的丰富理解,而不需要显式标签。然后可以针对各种任务对模型进行微调,例如文本分类、情感分析等,利用从自我监督预训练阶段获得的语言理解。
These strategies enable language models to obtain a rich understanding of language syntax and semantics directly from raw text without the need for explicit labels. The models can then be fine-tuned for various tasks such as text classification, sentiment analysis, and more, leveraging the language understanding gained from the self-supervised pretraining phase.
迁移学习是一种机器学习技术其中预训练模型被重新用作不同但相关问题的起点。与传统的机器学习方法(从使用随机权重初始化模型开始)相比,迁移学习的优点是可以从相关任务中学到的模式启动学习过程,这既可以加快训练过程,又可以提高模型的性能。模型的性能,尤其是当您的标记训练数据有限时。
Transfer learning is an ML technique where a pretrained model is reused as the starting point for a different but related problem. Compared to traditional ML approaches, where you start with initializing your model with random weights, transfer learning has the advantage of kick-starting the learning process from patterns that have been learned from a related task, which can both speed up the training process and improve the performance of the model, especially when you have limited labeled training data.
在迁移学习中,模型通常在大规模任务上进行训练,然后模型的一部分用作另一个任务的起点。大规模任务通常被选择得足够广泛,使得学习到的表示对于许多不同的任务都是有用的。当两个任务的输入数据具有相同类型并且任务相关时,此过程特别有效。
In transfer learning, a model is typically trained on a large-scale task, and then parts of the model are used as a starting point for another task. The large-scale task is often chosen to be broad enough that the learned representations are useful for many different tasks. This process works particularly well when the input data for both tasks are of the same type and the tasks are related.
应用迁移学习的方法有多种,最佳方法取决于您拥有多少数据您的任务的 ve 以及您的任务与模型训练的原始任务的相似程度。
There are several ways to apply transfer learning, and the best approach can depend on how much data you have for your task and how similar your task is to the original task the model was trained on.
预训练模型的作用作为特征提取器。您删除模型的最后一层或几层,使网络的其余部分保持不变。然后,您通过这种截断模式传递数据l 并将输出用作针对您的特定任务进行训练的新的较小模型的输入。
The pretrained model acts as a feature extractor. You remove the last layer or several layers of the model, leaving the rest of the network intact. Then, you pass your data through this truncated model and use the output as input to a new, smaller model that is trained for your specific task.
您使用预训练模型作为起始指向并更新新任务的全部或部分模型参数。换句话说,您可以在停止的地方继续训练,从而允许模型从通用特征提取调整为更适合您的任务的特征。通常,在微调期间使用较低的学习率,以避免在训练期间完全覆盖预先学习的特征。
You use the pretrained model as a starting point and update all or some of the model’s parameters for your new task. In other words, you continue the training where it left off, allowing the model to adjust from generic feature extraction to features more specific to your task. Often, a lower learning rate is used during fine-tuning to avoid overwriting the prelearned features entirely during training.
迁移学习是一种强大的技术,可用于提高机器学习模型的性能。它对于可用标记数据有限的任务特别有用。它常用于深度学习应用程序。例如,它几乎是图像分类问题的标准,其中 ImageNet 上的预训练模型、大规模带注释的图像数据集(ResNet、VGG、Inception 等)被用作起点。这些模型学习到的特征对于图像分类来说是通用的,并且可以用较少的数据量对特定的图像分类任务进行微调。
Transfer learning is a powerful technique that can be used to improve the performance of ML models. It is particularly useful for tasks where there are limited labeled data available. It is commonly used in DL applications. For instance, it’s almost a standard in image classification problems where pretrained models on ImageNet, a large-scale annotated image dataset (ResNet, VGG, Inception, and so on), are used as the starting point. The features learned by these models are generic for image classification and can be fine-tuned on a specific image classification task with a smaller amount of data.
Here are some examples of how transfer learning can be used:
类似地,在自然语言处理中,大型预训练模型,例如 BERT 或 GPT,通常被用作起始模型点用于广泛的任务。这些模型在大型文本语料库上进行预训练,并学习丰富的语言表示,可以根据需要进行微调。或特定任务,如文本分类、情感分析、问答等。
Similarly, in natural language processing, large pretrained models, such as BERT or GPT, are often used as the starting point for a wide range of tasks. These models are pretrained on a large corpus of text and learn a rich representation of language that can be fine-tuned for specific tasks such as text classification, sentiment analysis, question answering, and more.
Transformer是一种类型Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N. Gomez、Łukasz Kaiser 和 Illia Polosukhin的论文《注意力就是你所需要的》 ( 《神经信息处理系统的进展》)中介绍了神经网络架构的概念30(2017),哈佛大学)。它们在 NLP 领域非常有影响力,并构成了 BERT和 GPT等最先进模型的基础。
Transformers are a type of neural network architecture that was introduced in a paper called Attention is All You Need by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin (Advances in neural information processing systems 30 (2017), Harvard). They have been very influential in the field of NLP and have formed the basis for state-of-the-art models such as BERT and GPT.
Transformer 的关键创新是自注意力机制,它允许模型在生成输出时权衡输入中每个单词的相关性,从而考虑每个单词的上下文。这与之前的模型(例如 RNN 或 RNN)不同,后者处理输入 seq本质上,因此更难捕获远程依赖关系 言语之间。
The key innovation in transformers is the self-attention mechanism, which allows the model to weigh the relevance of each word in the input when producing an output, thereby considering the context of each word. This is unlike previous models such as RNNs or RNNs, which process the input sequentially and, therefore, have a harder time capturing the long-range dependencies between words.
Transformer的组成编码器和解码器,两者都是由几个相同的层组成,如图6 .8所示。编码器中的每一层都包含两个子层:自注意力机制和位置式全连接前馈网络。在两个子层周围采用残差连接,然后进行层归一化:
A transformer is composed of an encoder and a decoder, both of which are made up of several identical layers, as shown in Figure 6.8. Each layer in the encoder contains two sub-layers: a self-attention mechanism and a position-wise fully connected feedforward network. A residual connection is employed around each of the two sub-layers, followed by layer normalization:
图 6.8 – 自注意力机制
Figure 6.8 – Self-attention mechanism
类似地,解码器中的每一层都具有三个子层。第一个是自注意力层,第二个是交叉注意力层,负责编码器堆栈的输出,第三个是位置式全连接前馈网络。与编码器一样,每个子层周围都有一个残差连接,然后进行层归一化。请请注意,在图中,只显示了一个头,我们可以有多个头并行工作(N个头)。
Similarly, each layer in the decoder has three sub-layers. The first is a self-attention layer, the second is a cross-attention layer that attends to the output of the encoder stack, and the third is a position-wise fully connected feedforward network. Like the encoder, each of these sub-layers has a residual connection around it, followed by layer normalization. Please note that in the figure, just one head is being shown, and we can have multiple heads working in parallel (N heads).
自我关注机制,或缩放点积注意力,计算相关性序列中每个单词到正在处理的当前单词。自注意力层的输入是一系列词嵌入,每个词嵌入使用单独学习的线性变换分为查询(Q)、键(K)和值(V )。
The self-attention mechanism, or scaled dot-product attention, calculates the relevance of each word in the sequence to the current word being processed. The input to the self-attention layer is a sequence of word embeddings, each of which is split into a query (Q), a key (K), and a value (V) using separately learned linear transformations.
每个单词的注意力分数计算如下:
The attention score for each word is then calculated as follows:
其中d_k是查询和键的维度,用于缩放点积以防止其变得太大。softmax 运算确保注意力分数标准化并总和为 1。这些分数表示在生成当前单词的输出时赋予每个单词值的权重。
Where d_k is the dimensionality of the queries and keys, which is used to scale the dot product to prevent it from growing too large. The softmax operation ensures that the attention scores are normalized and sum to 1. These scores represent the weight given to each word’s value when producing the output for the current word.
自注意力层的输出是一个新的向量序列,其中 eac 的输出h word 是所有输入值的加权和,权重由注意力分数决定。
The output of the self-attention layer is a new sequence of vectors, where the output for each word is a weighted sum of all the input values, with the weights determined by the attention scores.
由于自注意力机制不考虑考虑到单词在序列中的位置,Transformer将位置编码添加到编码器和解码器堆栈底部的输入嵌入中。这种编码是位置的固定函数,允许模型学习使用单词的顺序。
Since the self-attention mechanism does not take into account the position of the words in the sequence, the transformer adds a positional encoding to the input embeddings at the bottom of the encoder and decoder stacks. This encoding is a fixed function of the position and allows the model to learn to use the order of the words.
在最初的 Transformer 论文中,位置编码是一个正弦函数 of 位置和维度,尽管学习了位置编码也曾被 有效地使用。
In the original transformer paper, positional encoding is a sinusoidal function of the position and the dimension, although learned positional encodings have also been used effectively.
自推出以来,Transformer已用于在各种 NLP 任务上取得最先进的结果,包括机器翻译、文本摘要、情感分析等。它们还适用于其他领域,例如计算机视觉和强化学习。
Since their introduction, transformers have been used to achieve state-of-the-art results on a wide range of NLP tasks, including machine translation, text summarization, sentiment analysis, and more. They have also been adapted for other domains, such as computer vision and reinforcement learning.
Transformer 的引入导致 NLP 领域转向在大型文本语料库上预训练大型 Transformer 模型,然后针对特定任务对其进行微调,这是一个效果迁移学习的五种形式。这种方法已在 BERT、GPT-2、GPT-3和 GPT-4等模型中使用。
The introduction of transformers has led to a shift in the NLP field towards pretraining large transformer models on a large corpus of text and then fine-tuning them on specific tasks, which is an effective form of transfer learning. This approach has been used in models such as BERT, GPT-2, GPT-3, and GPT-4.
Large language models are a class of ML models that have been trained on a broad range of internet text.
“大型语言模型”中的“大”一词是指这些模型具有的参数数量。例如,GPT-3有1750亿个参数。这些模型是在大型文本语料库上使用自监督学习进行训练的,这意味着它们可以预测句子中的下一个单词(例如 GPT)或基于周围单词的单词(例如 BERT,它也被训练来预测是否一对句子是连续的)。由于接触大量文本,这些模型会学习语法、世界事实、推理能力以及所训练数据中的偏差。
The term “large” in “large language models” refers to the number of parameters that these models have. For example, GPT-3 has 175 billion parameters. These models are trained using self-supervised learning on a large corpus of text, which means they predict the next word in a sentence (such as GPT) or a word based on surrounding words (such as BERT, which is also trained to predict whether a pair of sentences is sequential). Because they are exposed to such a large amount of text, these models learn grammar, facts about the world, reasoning abilities, and also biases in the data they’re trained on.
这些模型是基于 Transformer 的,这意味着它们利用 Transformer 架构,该架构使用自注意力机制来权衡输入数据中单词的重要性。这种架构允许这些模型处理文本中的远程依赖关系,使它们对于广泛的NLP 任务非常有效。
These models are transformer-based, meaning they leverage the transformer architecture, which uses self-attention mechanisms to weigh the importance of words in input data. This architecture allows these models to process long-range dependencies in text, making them very effective for a wide range of NLP tasks.
大型语言模型可以针对特定任务进行微调以实现高性能。微调涉及对较小的特定于任务的数据集进行额外的训练,并允许模型根据任务的具体情况调整其一般语言理解能力。这种方法已用于在许多NLP 基准测试中取得最先进的结果。
Large language models can be fine-tuned on specific tasks to achieve high performance. Fine-tuning involves additional training on a smaller, task-specific dataset and allows the model to adapt its general language understanding abilities to the specifics of the task. This approach has been used to achieve state-of-the-art results on many NLP benchmarks.
虽然大型语言模型表现出了令人印象深刻的能力,但它们也提出了重要的挑战。例如,因为他们接受过互联网文本训练,所以他们可以重现并放大数据中存在的偏见。它们还可能产生有害或误导性的输出。此外,由于它们的规模,这些模型需要大量的计算资源来训练和部署,这引发了成本和环境影响方面的问题。
While large language models have demonstrated impressive abilities, they also raise important challenges. For example, because they’re trained on internet text, they can reproduce and amplify biases present in the data. They can also generate outputs that are harmful or misleading. Additionally, due to their size, these models require significant computational resources to train and deploy, which raises issues around cost and environmental impact.
尽管存在这些挑战,大语言模型代表了人工智能领域的重大进步,是广泛应用的强大工具。应用范围包括翻译、摘要、内容创建、问答等。
Despite these challenges, large language models represent a significant advance in the field of AI and are a powerful tool for a wide range of applications, including translation, summarization, content creation, question answering, and more.
训练大语言模块德尔斯是一项复杂且资源密集型的任务,带来了多项挑战。以下是一些关键问题:
Training large language models is a complex and resource-intensive task that poses several challenges. Here are some of the key issues:
尽管存在这些挑战,大型语言模型领域仍在继续取得进展。研究人员是制定新的策略来缓解这些问题并更有效、更负责任地训练大型模型。
Despite these challenges, progress continues in the field of large language models. Researchers are developing new strategies to mitigate these issues and to train large models more effectively and responsibly.
在这里,我们将详细解释两种流行的语言模型架构:BERT 和 GPT。
Here, we are going to explain two popular architectures of language models, BERT and GPT, in detail.
BERT,我们提到的已经并将现在扩展的是一种用于 NLP 任务的基于 Transformer 的 ML 技术。它由 Google 开发,并在 Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova 的一篇论文中介绍,题为Bert:用于语言理解的深度双向变换器的预训练,arXiv 预印本arXiv:1810.04805 (2018)。
BERT, which we mentioned already and will now expand on, is a transformer-based ML technique for NLP tasks. It was developed by Google and introduced in a paper by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova titled Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
BERT 旨在通过所有层中左右上下文的联合条件来预训练未标记文本的深度双向表示。这与之前的方法形成对比,例如 GPT 和 ELMo,后者仅从左侧上下文或分别从左侧和右侧上下文预训练文本表示。这种双向性使得 BERT 能够更准确地理解单词的上下文和语义。
BERT is designed to pretrain deep bidirectional representations from the unlabeled text by joint conditioning on both left and right contexts in all layers. This is in contrast to previous methods, such as GPT and ELMo, which pretrain text representations from only the left context or from left and right contexts separately. This bi-directionality allows BERT to understand the context and the semantic meaning of a word more accurately.
BERT是基于Transformer的模型架构,如图6 .8所示,最初由 Vaswani 等人提出。在论文《注意力就是你所需要的》中。该模型架构由堆叠的自注意力层和逐点全连接层组成。
BERT is based on the transformer model architecture, which is shown in Figure 6.8, originally introduced by Vaswani et al. in the paper Attention is All You Need. The model architecture consists of stacked self-attention and point-wise fully connected layers.
BERT 有两种大小:BERT Base和BERT Large。BERT Base由12个Transformer组成层,每层有12个self-attention head,总共1.1亿个参数。BERT Large 更大,有 24 个 Transformer 层,每个层有 16 个自注意力头,总共 3.4 亿个参数。
BERT comes in two sizes: BERT Base and BERT Large. BERT Base is composed of 12 transformer layers, each with 12 self-attention heads, and a total of 110 million parameters. BERT Large is much bigger and has 24 transformer layers, each with 16 self-attention heads, for a total of 340 million parameters.
BERT的训练过程涉及两个步骤:预训练 和微调。
BERT’s training process involves two steps: pretraining and fine-tuning.
火车的第一步宁或使用语言模型就是创建或加载其字典。我们通常使用分词器来实现这个目标。
The very first step in training or using a language model is to create or load its dictionary. We usually use a tokenizer to achieve this goal.
为了使用语言有效地模型,我们需要使用一个标记器来转换输入文本数量有限的代币。子词标记化算法,例如字节对编码( BPE )、一元语言模型( ULM ) 和WordPiece,将单词分割成更小的单词子字单位。这对于处理词汇表外的单词非常有用,并且允许模型学习通常带有语义含义的子词部分的有意义的表示。
In order to use the language models efficiently, we need to use a tokenizer that converts the input text into a limited number of tokens. Subword tokenization algorithms, such as byte pair encoding (BPE), unigram language model (ULM), and WordPiece, split words into smaller subword units. This is useful for handling out-of-vocabulary words and allows the model to learn meaningful representations for subword parts that often carry semantic meaning.
BERT 分词器是 BERT 模型的关键组件,负责对输入模型所需的文本数据进行初始预处理。BERT 使用 WordPiece 标记化,这是一种子词标记化算法,可将单词分解为更小的部分,从而允许 BERT处理词汇外的单词,减少词汇量,处理语言的丰富性和多样性。
The BERT tokenizer is a critical component of the BERT model, performing the initial preprocessing of text data necessary for input into the model. BERT uses WordPiece tokenization, a subword tokenization algorithm that breaks words into smaller parts, allowing BERT to handle out-of-vocabulary words, reduce the size of the vocabulary, and deal with the richness and diversity of languages.
以下是 BERT 分词器工作原理的详细分解:
Here’s a detailed breakdown of how the BERT tokenizer works:
例如,单词“unhappiness”可能会被分解为两个WordPieces:“un”和“##happiness”。“##”符号用于表示子词,这些子词是较大单词的一部分,而不是整个单词。
For example, the word “unhappiness” might be broken down into two WordPieces: “un” and “##happiness”. The “##” symbol is used to denote sub-words that are part of a larger word and not a whole word on their own.
因此,总而言之,BERT 分词器的工作原理是首先将文本分词为单词,然后进一步将这些单词分解为 WordPieces(如果需要),添加特殊标记,最后将这些标记转换为 ID。这个过程允许模型理解并生成有意义的重新各种单词和子词的呈现,有助于 BERT 在各种NLP 任务上的强大表现。
So, in summary, the BERT tokenizer works by first tokenizing the text into words, then further breaking these words down into WordPieces (if necessary), adding special tokens, and finally converting these tokens into IDs. This process allows the model to understand and generate meaningful representations for a wide variety of words and sub-words, contributing to BERT’s powerful performance on various NLP tasks.
在预训练期间,BERT接受了大规模训练文本语料库(原始论文中使用了整个英文维基百科和 BooksCorpus)。该模型经过训练,可以预测句子中的屏蔽词(屏蔽语言模型),并区分两个句子在文本中是否按顺序排列(下一个句子预测),如下所述:
During pretraining, BERT was trained on a large corpus of text (the entire English Wikipedia and BooksCorpus are used in the original paper). The model was trained to predict masked words in a sentence (masked language model) and to distinguish whether two sentences come in order in the text (next sentence prediction), as explained here:
预训练后,BERT可以使用更少的训练数据对特定任务进行微调。微调包括向 BERT 添加额外的输出层,并针对特定任务端到端地训练整个模型。这种方法已被证明可以在各种 NLP 任务上实现最先进的结果,包括问答、命名实体识别、情感分析等。
After pretraining, BERT can be fine-tuned on a specific task with a significantly smaller amount of training data. Fine-tuning involves adding an additional output layer to BERT and training the entire model end-to-end on the specific task. This approach has been shown to achieve state-of-the-art results on a wide range of NLP tasks, including question answering, named entity recognition, sentiment analysis, and more.
BERT的设计及其预训练/微调方法彻底改变了 NL 领域P 并导致了向在广泛的数据上训练大型模型,然后针对特定任务对其进行微调的转变。
BERT’s design and its pretraining/fine-tuning approach revolutionized the field of NLP and have led to a shift toward training large models on a broad range of data and then fine-tuning them on specific tasks.
如前所述,BERT 已经过预训练在大型文本数据语料库上,以及所学到的表示可以微调特定任务,包括文本分类。以下是如何微调 BERT 以进行文本分类的分步过程:
As mentioned, BERT has been pretrained on a large corpus of text data, and the learned representations can be fine-tuned for specific tasks, including text classification. Here is a step-by-step process on how to fine-tune BERT for text classification:
重要的提示
Important note
请注意,使用BERT需要大量的计算资源,因为模型具有大量参数。通常建议使用 GPU 来微调和应用 BERT 模型。有一些模型比 BERT 更轻,但性能稍低,例如 DistilBERT,我们可以在受计算或内存资源限制的情况下使用。此外,BERT 能够处理 512 个标记,这限制了我们输入文本的长度。如果我们想处理更长的文本,Longformer 或 BigBird 是不错的选择。我们在这里解释的内容适用于类似的语言模型,例如 RoBERTa、XLNet 等。
Note that working with BERT requires considerable computational resources, as the model has a large number of parameters. A GPU is typically recommended for fine-tuning and applying BERT models. There are some models that are lighter than BERT with slightly lower performance, such as DistilBERT, that we can use in the case of being constrained by the computation or memory resources. Additionally, BERT is able to process 512 tokens, which limits the length of our input text. If we want to process longer text, Longformer or BigBird are good choices. What we explained here works for similar language models such as RoBERTa, XLNet, and so on.
总结一下,微调伯特文本分类涉及预处理输入数据、加载预训练的 BERT 模型、添加分类层、在标记数据上微调模型,然后评估并申请 该模型。
In summary, fine-tuning BERT for text classification involves preprocessing the input data, loading the pretrained BERT model, adding a classification layer, fine-tuning the model on the labeled data, and then evaluating and applying the model.
我们将演示前面的微调范例BERT,然后在本章末尾应用它。您将有机会直接使用它并根据您的需求进行调整。
We will demonstrate the preceding paradigm of fine-tuning BERT and then apply it at the end of this chapter. You will have the opportunity to employ it firsthand and adjust it to your needs.
GPT-3 ,生成式的缩写预训练Transformer 3是OpenAI 开发的自回归语言模型,使用深度学习技术生成类似人类的文本。它是 GPT 系列的第三个版本。随后的 GPT 版本 GPT-3.5 和 GPT-4 将在下一章中介绍,因为我们将扩展大型语言模型。
GPT-3, short for generative pretrained transformer 3, is an autoregressive language model developed by OpenAI that uses DL techniques to generate human-like text. It is the third version of the GPT series. The GPT versions that followed it, GPT-3.5 and GPT-4, will be covered in the next chapter, as we will expand on large language models.
GPT-3延长Transformer模型架构被其前辈所使用。该架构基于使用Transformer块层的Transformer模型,其中每个块由自注意力和前馈神经网络层组成。
GPT-3 extends the transformer model architecture used by its predecessors. The architecture is based on a transformer model that uses layers of transformer blocks, where each block is composed of self-attention and feedforward neural network layers.
与之前的版本相比,GPT-3 的规模更大。它由 1750 亿个 ML 参数组成。这些参数是在训练阶段学习的,其中模型学习预测单词序列中的下一个单词。
GPT-3 is massive compared to the previous versions. It consists of 175 billion ML parameters. These parameters are learned during the training phase, where the model learns to predict the next word in a sequence of words.
GPT-3 的转换器模型旨在处理数据序列(在本例中为文本中的单词或标记序列),使其非常适合语言任务。它从左到右顺序处理输入数据,并生成序列中下一项的预测。这就是BERT和GPT的区别,在BERT中,使用了双方的词来预测屏蔽词,但在 GPT 中,仅使用前面的词进行预测,这使其成为生成任务的不错选择。
GPT-3’s transformer model is designed to process sequences of data (in this case, sequences of words or tokens in text), making it well-suited for language tasks. It processes input data sequentially from left to right and generates predictions for the next item in the sequence. This is the difference between BERT and GPT, where, in BERT, words from both sides are used to predict masked words, but in GPT, just the previous words are used for prediction, which makes it a good choice for generative tasks.
类似于 BERT 和其他基于Transformer的模型中,GPT-3 还涉及两步过程:预训练 和微调。
Similar to BERT and other transformer-based models, GPT-3 also involves a two-step process: pretraining and fine-tuning.
在这个阶段,GPT-3被训练在大型文本数据语料库上。它学习预测句子中的下一个单词。然而,联合国与使用双向上下文进行预测的 BERT 一样,GPT-3 仅使用左侧上下文(即句子中的前一个单词)。
In this phase, GPT-3 is trained on a large corpus of text data. It learns to predict the next word in a sentence. However, unlike BERT, which uses a bidirectional context for prediction, GPT-3 only uses the left context (i.e., the previous words in the sentence).
预训练后阶段,GPT-3 可以使用较少的量对特定任务进行微调特定于任务的训练数据。这可以是任何 NLP 任务,例如文本完成、翻译、摘要、问答等。
After the pretraining phase, GPT-3 can be fine-tuned on a specific task using a smaller amount of task-specific training data. This could be any NLP task, such as text completion, translation, summarization, question answering, and so on.
令人印象深刻的功能之一GPT-3 的 是它的执行小样本学习的能力。当给定一个任务和该任务的一些示例时,GPT-3 通常可以学会准确地执行该任务。
One of the impressive features of GPT-3 is its capability to perform few-shot learning. When given a task and a few examples of that task, GPT-3 can often learn to perform the task accurately.
在零样本设置下,模型被赋予任务机智没有任何先前的例子。在一次性设置中,给出了一个示例,在几次设置中,给出了一些可供学习的示例。
In the zero-shot setting, the model is given a task without any prior examples. In the one-shot setting, it’s given one example, and in the few-shot setting, it’s given a few examples to learn from.
尽管其令人印象深刻能力方面,GPT-3 也提出了一些挑战。由于其规模较大,需要大量的计算资源来进行训练。它有时会生成不正确或无意义的响应,并且可以反映训练中存在的偏差阿塔。它还难以完成需要对世界有深刻理解或超出从文本中学到的常识推理的任务。
Despite its impressive capabilities, GPT-3 also presents some challenges. Due to its large size, it requires substantial computational resources to train. It can sometimes generate incorrect or nonsensical responses, and it can reflect biases present in the training data. It also struggles with tasks that require a deep understanding of the world or common sense reasoning beyond what can be learned from text.
在本节中,我们将工作关于现实世界的问题,看看我们如何使用 NLP 管道来解决它。这个pa的代码rt在Ch6_Text_Classification_DL.ipynb中作为 Google Colab 笔记本共享。
In this section, we are going to work on a real-world problem and see how we can use an NLP pipeline to solve it. The code for this part is shared as a Google Colab notebook at Ch6_Text_Classification_DL.ipynb.
在这种情况下,我们是在健康卡重新部门。我们的目标是开发一个通用医学知识引擎,该引擎与医疗保健领域的最新发现保持同步。
In this scenario, we are in the healthcare sector. Our objective is to develop a general medical knowledge engine that is very up to date with recent findings in the world of healthcare.
CTO 得出几个技术目标从经营目标来看。机器学习团队的目标之一是:鉴于与医学相关的结论不断增多l 出版物,找出那些代表建议的出版物。这将使我们能够确定源自基础研究的医疗建议。
The CTO derives several technical objectives from the business objective. One objective is for the ML team: given the growing collection of conclusions that correspond to medical publications, identify the ones that represent advice. This will allow us to identify the medical advice that stems from the underlying research.
Let’s review the parts of the pipeline, as depicted in Figure 6.9:
图 6.9 – 典型勘探和模型管道的结构
Figure 6.9 – The structure of a typical exploration and model pipeline
请注意此设计与我们在图 5 .2中看到的设计有何不同。在那里,探索和评估部分利用了后来被 ML 模型使用的相同特征工程技术。在这里,对于 LM,特征工程并不是建模准备工作的一部分。预训练模型,特别是分词器,执行特征工程,产生与二进制、BoW 或TF-IDF 特征非常不同且难以解释的特征。
Notice how this design is different from the design we saw in Figure 5.2. There, the exploration and evaluation parts leverage the same feature engineering technique that is later used by the ML models. Here, with LMs, feature engineering is not a part of the preparation for the modeling. The pretrained model, and particularly the tokenizer, performs feature engineering, which yields very different and less interpretable features than the binary, BoW, or TF-IDF features.
笔记
Note
Code parts: From “Settings” through “Generating Results of the Traditional ML Models.”
这些部分本质上是相同的到第 5 章讨论的模拟部分。唯一的差异与数据的差异有关。
These parts are identical in their nature to the analog parts discussed in Chapter 5. The only differences relate to the differences in the data.
In this part of the code, we employ a deep learning language model.
当希望通过 LM 应用迁移学习并根据我们的目标和数据对其进行微调时,有多种堆栈可供选择。最突出的是 Google 的 TensorFlow 和 Meta 的 PyTorch。一个名为Transformers的包被构建为包装器围绕这些堆栈可以更简单地实现代码。在此示例中,我们利用 Transformer 模型的简单性和丰富性。
When looking to apply transfer learning via LMs and fine-tuning them per our objective and data, there are several stacks to choose from. The ones that stand out the most are Google’s TensorFlow, and Meta’s PyTorch. A package called Transformers was built as a wrapper around these stacks to allow for a simpler implementation of the code. In this example, we leverage the simplicity and richness of transformers models.
值得强调的是构建并支持变形金刚包的公司:Hugging Face。Hugging Face 围绕免费开源深度学习模型的收集和共享创建了一个完整的生态系统,其中包括许多用于实现这些模型的组件。最实用的工具是 Transformers 包,它是一个 Python 包,专门用于挑选、导入、训练和使用大量且不断增长的DL 模型。
It is worth highlighting the company that built and supports the Transformers package: Hugging Face. Hugging Face took it upon themselves to create an entire ecosystem around the collection and sharing of free, open source DL models, which includes the many components that accommodate for implementing these models. The most actionable tool is the Transformers package, which is a Python package dedicated to picking, importing, training, and employing a large and growing set of DL models.
这我们正在审查的代码这里提供的不仅仅是现实世界中的 ML/DL 系统设计示例;它还展示了拥抱F艾斯的变形金刚。
The code we are reviewing here provides more than just an example of ML/DL system design in the real world; it also showcases Hugging Face’s Transformers.
在这里,我们将数据设置为格式适合 Transformers 库。列名称必须非常具体。
Here, we set the data up in a format that suits the Transformers library. The column names must be very specific.
我们决定了哪个指标我们希望对其进行优化并将其插入培训过程中。对于这个问题在二元分类中,我们优化了准确性并评估了我们的结果与数据集的基线精度(也称为先验)相比。
We decided which metric we wished to optimize and plugged it into the training process. For this problem of binary classification, we optimized for accuracy and evaluated our result in comparison to the dataset’s baseline accuracy, also known as the prior.
这是核心对象为了训练变形金刚中的LM。它拥有一组预定义的配置。一些关键的训练配置如下:
This is the core object for training the LM in Transformers. It holds a set of predefined configurations. Some of the key training configurations are the following:
基本概念围绕微调 LM是迁移学习。神经网络非常适合迁移学习,因为人们可以简单地从结构末端剥离任意数量的层,并用未训练的层替换它们,这些层将根据潜在问题进行训练。其余未删除且未训练的层继续以与 LM 最初训练时(最初构建时)完全相同的方式运行。如果我们替换最后一层但保留原始层的其余部分,那么我们可以将这些层视为有监督的特征工程,或者相反,将其视为嵌入机制。这一特征体现了迁移学习的概念。理想情况下,该模型应该能够很好地解决我们的根本问题,以便我们选择保留绝大多数原始层,而只有一小部分被替换和训练。通过这种方式,需要数周时间预训练的大型深度学习模型可以在几分钟内转移并适应新问题。
The fundamental concept around fine-tuning LMs is transfer learning. Neural networks lend themselves so well to transfer learning because one can simply strip any number of layers from the end of the structure and replace them with untrained layers that would be trained based on the underlying problem. The rest of the layers that weren’t removed and aren’t trained continue to operate exactly in the same way they did when the LM was originally trained (when it was originally built). If we replace the last layer but leave the rest of the original layers, then we could view those layers as supervised feature engineering or, conversely, as an embedding mechanism. This trait reflects the concept of transfer learning. Ideally, the model is expected to lend itself well to our underlying problem so that we will choose to keep the vast majority of the original layers, and only a small minority would be replaced and trained. In this way, a large DL model that took many weeks to be pretrained can be transferred and adapted to a new problem in minutes.
在我们的代码中,我们设置模型的方式是我们精确地指定我们要微调的层。这是我们基于性能和计算资源的设计选择。一种选择是在最终之前微调最后一层输出,也称为分类头。另一种方法是微调所有层。在我们的代码中,我们显式调用模型的配置,它控制对哪一层进行微调,因此可以以适合设计的任何方式更改代码。
In our code, we set the model up in a way that we dictate exactly which of its layers we are looking to fine-tune. It is a design choice for us for this to be based on performance and also computation resources. One choice is to fine-tune the last layer right before the final output, also known as the classification head. The alternative is to fine-tune all the layers. In our code, we explicitly call the model’s configuration, which controls which layer is fine-tuned, so the code can be changed in any way that suits the design.
我们配置训练器来实时记录训练的表现。它将这些日志打印在表格中,以便我们可以观察和监控它们。训练完成后,我们绘制训练进度和评估结果。这有助于我们了解训练结果的演变与评估结果之间的关系。自从训练器使用的评估集可以被视为训练器上下文中的保留集,该图允许我们调查欠拟合和过拟合。
We configure the trainer to log the performance of the training in real time. It prints those logs out for us in a table so we can observe and monitor them. When the training is complete, we plot the progress of the training and the evaluation. This helps us see the relation between the evolution of the training results and the evaluation results. Since the evaluation set that the trainer uses can be viewed as a held-out set in the context of the trainer, this plot allows us to investigate underfitting and overfitting.
我们回顾了培训的结果设置,以及培训师打印的日志。我们将它们与基线准确度进行比较,并观察到准确度有所提高。我们通过迭代几种不同的设计选择并进行比较来了解我们的设计质量。迭代多组设计参数的过程将自动生成代码,以便对最佳设置进行系统评估。我们做到了在我们的笔记本中不要这样做只是为了让示例中的事情变得简单。一旦我们相信我们已经找到了最佳设置,我们就可以说该过程已经完成。
We reviewed the results of the training set, along with the logs that the trainer printed out. We compared them to the baseline accuracy and observed an increase in accuracy. We learned about the quality of our design by iterating over several different design choices and comparing them. That process of iterating over many sets of design parameters would be automated into code to allow for a systematic evaluation of the optimal setting. We didn’t do that in our notebook just to keep things simple in the example. Once we believed we had found the optimal setting, we could say that the process was finished.
与第 5 章中的代码一样,这里我们也完成了通过查看测试结果。值得注意的是评估集和测试集之间的差异。有人可能会建议,由于训练者不使用评估集进行训练,因此可以将其用作保留的测试集,从而无需从训练中排除如此多的观察结果,并为模型提供更多标记数据。然而,虽然培训师没有使用评估集,但我们确实使用它来做出设计决策。例如,我们观察上一节中的图,并判断哪个时期数是最佳的以实现最佳拟合婷。在第5章中,也使用了评估集,但我们不需要明确定义它;它是作为 K 折交叉验证机制的一部分进行的。
As with the code in Chapter 5, here, too, we finished by reviewing the test results. It is worth noting the difference between the evaluation set and the test set. One could suggest that since the trainer doesn’t use the evaluation set for training, it could be used as a held-out test set, thus saving the need to exclude so many observations from training and supplying the model with more labeled data. However, while the trainer didn’t use the evaluation set, we did use it to make our design decisions. For instance, we observed the plot from the preceding section and judged which number of epochs is optimal to achieve optimal fitting. In Chapter 5, an evaluation set was used too, but we didn’t need to explicitly define it; it was carried out as a part of the K-fold cross-validation mechanism.
在这一富有启发性的章节中,我们开始全面探索深度学习及其通过语言模型在文本分类任务中的出色应用。我们首先概述了 DL,揭示了它从大量数据中学习复杂模式的深厚能力,以及它在推进最先进的NLP 系统方面无可争议的作用。
In this enlightening chapter, we embarked on a comprehensive exploration of DL and its remarkable application to text classification tasks through language models. We began with an overview of DL, revealing its profound ability to learn complex patterns from vast amounts of data and its indisputable role in advancing state-of-the-art NLP systems.
然后,我们深入研究了 Transformer 模型的变革世界,该模型通过提供传统 RNN 和 CNN 处理序列数据的有效替代方案,彻底改变了 NLP。通过解开注意力机制(Transformer 的一个关键特征),我们强调了它关注输入序列不同部分的能力,从而有助于更好地理解上下文。
We then delved into the transformative world of transformer models, which have revolutionized NLP by providing an effective alternative to traditional RNNs and CNNs for processing sequence data. By unpacking the attention mechanism—a key feature in transformers—we highlighted its capacity to focus on different parts of the input sequence, hence facilitating a better understanding of context.
我们的旅程继续深入探索 BERT 模型。我们详细介绍了它的架构,强调了它开创性地使用双向训练来生成上下文丰富的词嵌入,并强调了它的预训练过程,该过程从大型文本语料库中学习语言语义。
Our journey continued with an in-depth exploration of the BERT model. We detailed its architecture, emphasizing its pioneering use of bidirectional training to generate contextually rich word embeddings, and we highlighted its pretraining process, which learns language semantics from a large text corpus.
然而我们的探索并没有就此结束;我们还引入了 GPT,这是另一种变革性模型,它以稍微不同的方式利用 Transformer 的力量,专注于生成类似人类的文本。通过比较 BERT 和 GPT,我们揭示了它们的独特优势和用例。
However, our exploration did not end there; we also introduced GPT, another transformative model that leverages the power of transformers in a slightly different way—focusing on generating human-like text. By comparing BERT and GPT, we shed light on their distinct strengths and use cases.
本章最后提供了有关如何使用这些高级模型设计和实现文本分类模型的实用指南。我们引导您完成此过程的所有阶段,从数据预处理和模型配置到训练、评估,最后对看不见的数据进行预测。
The chapter culminated in a practical guide on how to design and implement a text classification model using these advanced models. We walked you through all the stages of this process, from data preprocessing and model configuration to training, evaluation, and finally, making predictions on unseen data.
本质上,本章提供了对 NLP 中的 DL 的全面理解,从基本原理过渡到实际应用。有了这些知识,您现在就可以利用 Transformer 模型、BERT 和 GPT 的功能来完成文本分类任务。无论您是想进一步深入 NLP 世界,还是在实际环境中应用这些技能,本章都为您奠定了坚实的基础。
In essence, this chapter provided a well-rounded understanding of DL in NLP, transitioning from fundamental principles to hands-on applications. With this knowledge, you are now equipped to leverage the capabilities of transformer models, BERT, and GPT for your text classification tasks. Whether you are looking to delve further into the world of NLP or apply these skills in a practical setting, this chapter has equipped you with a firm foundation on which to build.
在本章中,我们向您介绍了大型语言模型。在下一章中,我们将更深入地研究这些模型,以了解更多关于它们的信息。
In this chapter, we introduced you to large language models. In the next chapter, we dive deeper into these models to learn more about them.
在本章中,我们深入研究大型语言模型(LLM)的复杂世界以及推动其性能的基础数学概念。这些模型的出现彻底改变了自然语言处理( NLP )领域,在理解、生成人类语言以及与人类语言交互方面提供了无与伦比的熟练程度。
In this chapter, we delve deep into the intricate world of large language models (LLMs) and the underpinning mathematical concepts that fuel their performance. The advent of these models has revolutionized the field of natural language processing (NLP), offering unparalleled proficiency in understanding, generating, and interacting with human language.
LLMs是人工智能( AI ) 模型的一个子集,可以理解和生成类似人类的文本。他们通过接受各种互联网文本的培训来实现这一目标,从而了解有关世界的广泛事实。他们还学会预测一段文本中接下来会发生什么,这使他们能够生成富有创意、流畅且上下文连贯的句子。
LLMs are a subset of artificial intelligence (AI) models that can understand and generate human-like text. They achieve this by being trained on a diverse range of internet text, thus learning an extensive array of facts about the world. They also learn to predict what comes next in a piece of text, which enables them to generate creative, fluent, and contextually coherent sentences.
当我们探索LLMs的运作时,我们将介绍困惑度的关键指标,这是一种不确定性的衡量标准,对于确定这些模型的性能至关重要。较低的困惑度表明语言模型( LM ) 在预测序列中的下一个单词时有信心,从而展示其熟练程度。
As we explore the operations of LLMs, we will introduce the key metric of perplexity, a measurement of uncertainty that is pivotal in determining the performance of these models. A lower perplexity indicates the confidence that a language model (LM) has in predicting the next word in a sequence, thus showcasing its proficiency.
本章借鉴了多篇富有洞察力的出版物,深入探讨了LLMs的数学见解。其中包括《神经概率语言模型》、《注意力就是你所需要的》和《PaLM:使用路径扩展语言模型》。这些来源将指导我们了解支撑LLMs及其卓越能力的强大机制。
This chapter draws on multiple insightful publications that delve into the mathematical insights of LLMs. Some of these include A Neural Probabilistic Language Model, Attention is All You Need, and PaLM: Scaling Language Modeling with Pathways. These sources will guide us in understanding the robust mechanisms that underpin LLMs and their exceptional capabilities.
我们还将在 LM 的背景下探索基于人类反馈的强化学习(RLHF )这一新兴领域。 RLHF 已被证明是微调LLMs性能的强大工具,从而生成更准确、更有意义的文本。
We will also explore the emerging field of reinforcement learning from human feedback (RLHF) in the context of LMs. RLHF has proven to be a powerful tool in fine-tuning the performance of LLMs, thereby leading to more accurate and meaningful generated texts.
通过对LLMs的数学基础的全面理解和对 RLHF 的深入研究,我们将获得这些先进人工智能系统的丰富知识,为该领域未来的创新和进步铺平道路。
With a comprehensive understanding of the mathematical foundations of LLMs and a deep dive into RLHF, we will gain a robust knowledge of these advanced AI systems, paving the way for future innovations and advancements in the field.
最后,我们将讨论最新模型的详细架构和设计,例如Pathways Language Model ( PaLM )、Large Language Model Meta AI ( LLaMA )和 GPT-4。
Finally, we will discuss the detailed architecture and design of recent models, such as Pathways Language Model (PaLM), Large Language Model Meta AI (LLaMA), and GPT-4.
现在,让我们看看本章涵盖的主题:
Now, let’s look at the topics covered in this chapter:
在本章中,您应该在机器学习( ML ) 概念方面拥有坚实的基础,特别是在Transformer和强化学习领域。对基于 Transformer 的模型的理解至关重要,这些模型是当今许多LLMs的基础。这包括熟悉自注意力机制、位置编码和编码器-解码器架构的结构等概念。
For this chapter, you are expected to possess a solid foundation in machine learning (ML) concepts, particularly in the areas of Transformers and reinforcement learning. An understanding of Transformer-based models, which underpin many of today’s LLMs, is vital. This includes familiarity with concepts such as self-attention mechanisms, positional encoding, and the structure of encoder-decoder architectures.
强化学习原理的知识也很重要,因为我们将深入研究 RLHF 在 LM 微调中的应用。熟悉策略梯度、奖励函数和 Q 学习等概念将极大地增强您对本内容的理解。
Knowledge of reinforcement learning principles is also essential, as we will delve into the application of RLHF in the fine-tuning of LMs. Familiarity with concepts such as policy gradients, reward functions, and Q-learning will greatly enhance your comprehension of this content.
最后,编码能力(特别是 Python 能力)至关重要。这是因为许多概念将通过编程的视角进行演示和探索。使用 PyTorch 或 TensorFlow、流行的 ML 库以及 Hugging Face 的 Transformers 库(使用 Transformer 模型的关键资源)的经验也将很有帮助。
Lastly, coding proficiency, specifically in Python, is crucial. This is because many of the concepts will be demonstrated and explored through the lens of programming. Experience with PyTorch or TensorFlow, popular ML libraries, and Hugging Face’s Transformers library, a key resource for working with transformer models, will also be beneficial.
但是,如果您觉得自己在某些方面有所欠缺,请不要灰心。本章旨在引导您了解这些主题的复杂性,并弥合沿途的任何知识差距。所以,准备好学习的心态,让我们深入探索LLMs的迷人世界吧!
However, don’t be discouraged if you feel you’re lacking in some areas. This chapter aims to walk you through the complexities of these subjects, bridging any knowledge gaps along the way. So, come prepared with a mindset for learning, and let’s delve into the fascinating world of LLMs!
LM 是一种 ML 模型,经过训练可以预测下一个单词(或字符或子词,具体取决于给定序列中之前的单词(或在某些模型中,周围的单词)。它是一种概率模型,能够生成遵循某种语言风格或模式的文本。
An LM is a type of ML model that is trained to predict the next word (or character or subword, depending on the granularity of the model) in a sequence, given the words that came before it (or in some models, the surrounding words). It’s a probabilistic model that is capable of generating text that follows a certain linguistic style or pattern.
在基于 Transformer 的出现之前模型,例如生成预训练 Transformers ( GPT ) 和来自 Transformers 的双向编码器表示( BERT ),还有其他几种广泛使用的 LM用于 NLP 任务。以下小节讨论了其中的一些米。
Before the advent of Transformer-based models such as generative pretrained Transformers (GPTs) and Bidirectional Encoder Representations from Transformers (BERT), there were several other types of LMs widely used in NLP tasks. The following subsections discuss a few of them.
这些是其中的一些最简单的 LM。n - gram 模型使用 ( n -1) 个先前的单词来预测句子中的第n 个单词。例如,在二元模型(2-gram)中,我们将使用前一个单词来预测下一个词。这些模型易于实现且计算效率高,但它们的性能通常不如更复杂的模型,因为它们无法捕获单词之间的远程依赖关系。随着n 的增加,它们的性能也会下降,因为它们遭受数据稀疏问题(没有足够的数据来准确估计所有可能的n gra的概率)多发性硬化症)。
These are some of the simplest LMs. An n-gram model uses the (n-1) previous words to predict the nth word in a sentence. For example, in a bigram (2-gram) model, we would use the previous word to predict the next word. These models are easy to implement and computationally efficient, but they typically don’t perform as well as more complex models because they don’t capture long-range dependencies between words. Their performance also degrades as n increases, as they suffer from data sparsity issues (not having enough data to accurately estimate the probabilities for all possible n-grams).
这些模型考虑了“隐藏”状态生成观测数据。在语言建模的背景下,每个单词都将是一个观察到的状态,而“隐藏”状态将是某种无法直接观察到的语言特征(例如单词的词性)。然而,与n元语法模型一样,HMM 很难捕获之间的远程依赖关系字。
These models consider the “hidden” states that generate the observed data. In the context of language modeling, each word would be an observed state, and the “hidden” state would be some kind of linguistic feature that’s not directly observable (such as the part of speech of the word). However, like n-gram models, HMMs struggle to capture long-range dependencies between words.
这是一种神经网络,其中节点之间的连接形成有向图沿着时间顺序。这允许他们使用内部状态(内存)处理输入序列,使其成为语言建模的理想选择。它们可以捕获单词之间的远程依赖关系,但它们与所谓的梯度消失问题作斗争,这使得在 p 中学习这些依赖关系变得困难练习。
These are a type of neural network where connections between nodes form a directed graph along a temporal sequence. This allows them to use their internal state (memory) to process sequences of inputs, making them ideal for language modeling. They can capture long-range dependencies between words, but they struggle with the so-called vanishing gradient problem, which makes it difficult to learn these dependencies in practice.
LSTM 网络是一种特殊的 RNN,旨在学习长期依赖关系。他们通过使用一系列控制信息进出的“门”来做到这一点。网络的记忆状态。 LSTM 是语言技术发展的一大进步造型。
An LSTM network is a special kind of RNN that is designed to learn long-term dependencies. They do this by using a series of “gates” that control the flow of information in and out of the memory state of the network. LSTMs were a big step forward in the state of the art of language modeling.
这些是一个变体的 LSTM 在其架构中使用了一组略有不同的门。它们的训练通常比 LSTM 更简单、更快,但它们的表现是否比 LSTM 更好或更差往往取决于手头的具体任务。
These are a variation of LSTMs that use a slightly different set of gates in their architecture. They’re often simpler and faster to train than LSTMs, but whether they perform better or worse than LSTMs tends to depend on the specific task at hand.
这些模型中的每一个都有自己的优点和缺点,并且它们本质上并不比其他模型更好或更差——这一切都取决于特定的任务和数据集。然而,基于 Transformer 的模型在广泛的任务中普遍优于所有这些模型,导致它们目前在金融领域很受欢迎。NLP领域。
Each of these models has its own strengths and weaknesses, and none of them are inherently better or worse than the others – it all depends on the specific task and dataset. However, Transformer-based models have generally outperformed all of these models in a wide range of tasks, leading to their current popularity in the field of NLP.
LLMs,例如 GPT-3 和GPT-4 是经过大量文本训练并具有大量参数的简单语言模型。模型越大(就参数和训练数据而言),理解和生成复杂多样的文本的能力就越强。以下是LLMs与小型LLMs的一些主要区别:
LLMs, such as GPT-3 and GPT-4, are simply LMs that are trained on a very large amount of text and have a very large number of parameters. The larger the model (in terms of parameters and training data), the more capable it is of understanding and generating complex and varied texts. Here are some key ways in which LLMs differ from smaller LMs:
因此,我们可以说LLMs本质上是小型LLMs的放大版本。它们接受了更多数据的训练,拥有更多的参数,通常能够产生更高质量的结果,但它们也需要更多的资源来训练和使用。除此之外,LLMs的一个重要优势是,我们可以在大量数据集上进行无监督训练,然后用有限的数据对它们进行微调,以适应不同的需求。不同的任务。
Thus, we can say that LLMs are essentially scaled-up versions of smaller LMs. They’re trained on more data, have more parameters, and are generally capable of producing higher-quality results, but they also require more resources to train and use. Besides that, an important advantage of an LLM is that we can train them unsupervised on a large corpus of data and then fine-tune them with a limited amount of data for different tasks.
开发和使用LLMs的动机源于与这些模型的能力相关的几个因素,以及潜在的好处它们可以带来不同的应用。以下小节详细介绍了其中的一些内容他们的动机。
The motivation to develop and use LLMs arises from several factors related to the capabilities of these models, and the potential benefits they can bring in diverse applications. The following subsections detail a few of these key motivations.
与规模较小的LLMs相比,当使用足够的数据进行训练时,LLMs通常会表现出更好的性能楷模。他们更有能力理解上下文、识别细微差别并生成连贯且与上下文相关的响应。这种性能提升适用于 NLP 中的各种任务,包括文本分类、命名实体识别、情感分析、机器翻译、问答和文本生成。如表 7.1所示,BERT(最早的知名LLMs之一)和 GPT 的性能与通用语言理解评估(GLUE)基准上的先前模型进行了比较。GLUE 基准测试是各种自然语言理解( NLU ) 任务的集合,旨在评估模型在多种语言挑战中的性能。基准包括情感分析、问题回答和文本蕴涵等任务。它是 NLU 领域广泛认可的标准,为比较和改进语言理解模型提供了一套全面的套件。可以看到它在所有任务中的表现都比较好:
LLMs, when trained with sufficient data, generally demonstrate better performance compared to smaller models. They are more capable of understanding context, identifying nuances, and generating coherent and contextually relevant responses. This performance gain applies to a wide range of tasks in NLP, including text classification, named entity recognition, sentiment analysis, machine translation, question answering, and text generation. As shown in Table 7.1, the performance of BERT – one of the first well-known LLMs – and GPT is compared to the previous models on the General Language Understanding Evaluation (GLUE) benchmark. The GLUE benchmark is a collection of diverse natural language understanding (NLU) tasks designed to evaluate the performance of models across multiple linguistic challenges. The benchmark encompasses tasks such as sentiment analysis, question answering, and textual entailment, among others. It’s a widely recognized standard in the field of NLU, providing a comprehensive suite for comparing and improving language understanding models. It can be seen that its performance is better in all tasks:
|
模型 Model |
平均(所有任务中) Average (in all tasks) |
情感分析 Sentiment analysis |
语法 Grammatical |
相似 Similarity |
|
BERT 大号 BERT large |
82.1 82.1 |
94.9 94.9 |
60.5 60.5 |
86.5 86.5 |
|
BERT基础 BERT base |
79.6 79.6 |
93.5 93.5 |
52.1 52.1 |
85.8 85.8 |
|
OpenAI GPT OpenAI GPT |
75.1 75.1 |
91.3 91.3 |
45.4 45.4 |
80.0 80.0 |
|
预开放 AI 最先进技术( STOA ) Pre-open AI State of the Art (STOA) |
74.0 74.0 |
93.2 93.2 |
35.0 35.0 |
81.0 81.0 |
|
双向长短期记忆(BiLSTM)+语言模型嵌入(ELMo)+注意力 Bidirectional Long Short-Term memory (BiLSTM) + Embeddings from Language Model (ELMo) + Attention |
71.0 71.0 |
90.4 90.4 |
36.0 36.0 |
73.3 73.3 |
表 7.1 – 比较不同模型在 GLUE 上的性能(此比较基于 2018 年 BERT 和 GPT 发布时)
Table 7.1 – Comparing different models’ performance on GLUE (this comparison is based on 2018 when BERT and GPT were released)
LLMs接受过多种培训数据集可以更好地泛化不同的任务、领域或语言风格。他们可以有效地从训练数据中学习,识别和理解广泛的语言模式、风格和主题。这种广泛的泛化能力使它们适用于各种应用,从聊天机器人到内容创建再到信息检索。
LLMs trained on diverse datasets can generalize better across different tasks, domains, or styles of language. They can effectively learn from the training data to identify and understand a wide range of linguistic patterns, styles, and topics. This broad generalization capability makes them versatile for various applications, from chatbots to content creation to information retrieval.
当LM更大时,意味着它有更多的参数。这些参数允许模型捕获和编码数据中更复杂的关系和细微差别。换句话说,更大的模型可以从训练数据中学习并保留更多信息。因此,它能够在培训后更好地处理更广泛的任务和环境。正是这种复杂性和容量的增加,使得更大的语言模型在不同的任务中更具通用性。正如我们在图 7 .1中看到的,较大的 LM 在不同的任务中表现更好。
When an LM is bigger, it means it has more parameters. These parameters allow the model to capture and encode more complex relationships and nuances within the data. In other words, a bigger model can learn and retain more information from the training data. As such, it is better equipped to handle a wider array of tasks and contexts post-training. It is this increased complexity and capacity that makes bigger LMs more generalizable across different tasks. As we can see in Figure 7.1, the bigger LMs perform better in different tasks.
图 7.1 – LLMs的表现取决于其规模和培训
Figure 7.1 – LLMs performance based on their size and training
We can also see the progress in the development of the LLMs within the last three years in Figure 7.2.
图 7.2 – 2019 年至 2023 年发布的 LM(突出显示了公开可用的模型)
Figure 7.2 – The released LMs within 2019 to 2023 (the publicly available models are highlighted)
然而,值得注意的是,虽然较大的模型往往更具有通用性,但它们也带来了挑战,例如计算要求的增加和过度拟合的风险。确保训练数据能够代表模型预期执行的任务和领域也很重要,因为模型可能会携带训练数据中存在的任何偏差A。
However, it’s important to note that while larger models tend to be more generalizable, they also pose challenges such as increased computational requirements and the risk of overfitting. It is also essential to ensure that the training data is representative of the tasks and domains the model is expected to perform in, as models might carry over any biases present in the training data.
GPT-3、GPT-3.5 和 GPT-4 等LLMs已经展示了令人印象深刻的几次学习能力。给定一个很少有例子(“镜头”),这些模型可以泛化以有效地完成类似的任务。这使得在现实应用程序中调整和部署这些模型更加高效。提示可以设计为包括供模型参考的信息,例如示例问题及其各自的答案。
LLMs such as GPT-3, GPT-3.5, and GPT-4 have demonstrated impressive few-shot learning capabilities. Given a few examples (the “shots”), these models can generalize to complete similar tasks effectively. This makes adjusting and deploying these models in real-world applications more efficient. The prompts can be designed to include information for the model to refer to, such as example questions and their respective answers.
该模型暂时从给定的示例中学习,并将给定的信息作为附加来源。例如,当LLMs被用作个人助理或顾问时,有关用户的背景信息可以附加到提示中,从而允许模型来“了解你”,因为它使用您的个人信息提示作为参考。
The model temporarily learns from given examples and refers to given information as an additional source. For example, when the LLM is used as a personal assistant or advisor, background information about the user can be appended to the prompt, allowing the model to “get to know you,” as it uses your personal information prompts as a reference.
LLMs具有理解复杂背景的优势,因为他们对广泛的数据进行了广泛的训练,包括各种主题、文学风格和细微差别以及其深层架构和大参数空间。这种能力使他们能够即使在复杂或微妙的情况下也能理解并产生适当的反应。
LLMs have the advantage of understanding complex contexts due to their extensive training on a wide range of data, including various topics, literary styles, and nuances as well as their deep architecture and large parameter space. This capacity allows them to comprehend and generate appropriate responses even in complex or nuanced situations.
例如,考虑这样一个场景:用户要求模型总结一篇复杂的科学文章。LLMs可以理解文章中使用的上下文和技术语言,并生成连贯且简化的文章玛丽。
For example, consider a scenario where a user asks the model to summarize a complicated scientific article. An LLM can understand the context and the technical language used in the article and generate a coherent and simplified summary.
LLM 可以处理有效地多种语言,使其适合全球应用。这里有几个比较知名的多林双 LM。
LLMs can handle multiple languages effectively, making them suitable for global applications. Here are a few well-known multilingual LMs.
mBERT 是 BERT 的扩展,针对维基百科最多的 104 种语言进行了预训练使用 masked LM o客观的。
An extension of BERT, mBERT is pretrained on the top 104 languages with the largest Wikipedia using a masked LM objective.
这是受过训练的100 种语言。它扩展了 BERT 模型,包含多种跨语言模型的方法训练。
This is trained in 100 languages. It extends the BERT model to include several methods for cross-lingual model training.
XLM-RoBERTa 扩展了 RoBERTa,它本身是 BERT 的优化版本,并在更大的多语言语料库涵盖更多语言。
XLM-RoBERTa extends RoBERTa, which itself is an optimized version of BERT, and is trained on a much larger multilingual corpus covering more languages.
部分拥抱脸的变形金刚库,MarianMT 是一个最先进的基于 Transformer 的模型,针对反式优化翻译任务。
Part of Hugging Face’s Transformers library, MarianMT is a state-of-the-art Transformer-based model optimized for translation tasks.
This is a smaller and faster version of mBERT, achieved through a distillation process.
这是文本到文本传输转换器( T5 ) 模型的变体,针对翻译进行了微调任务。
This is a variant of the Text-to-Text Transfer Transformer (T5) model, which is fine-tuned for translation tasks.
这些模型已经实现了在多种任务中取得了显着的成果,例如翻译、命名实体识别、词性标记和多语言中的情感分析三种语言。
These models have achieved significant results in a variety of tasks, such as translation, named entity recognition, part-of-speech tagging, and sentiment analysis in multiple languages.
LLMs在生成类人文本方面表现出了非凡的能力。他们可以结合实际情况进行创作适当的反应对话、写论文以及生成诗歌和故事等创意内容。GPT-3、ChatGPT、GPT-4等模型在文本生成任务中表现出了良好的效果。
LLMs have shown a remarkable capability in generating human-like text. They can create contextually appropriate responses in conversations, write essays, and generate creative content such as poetry and stories. Models such as GPT-3, ChatGPT, and GPT-4 have shown good results in text generation tasks.
虽然优点很多,但值得注意的是,使用LLMs也存在挑战和潜在风险。它们需要大量的计算资源来训练和部署,并且人们持续关注它们产生有害或有偏见内容的可能性、它们的可解释性以及它们对环境的影响。研究人员正在积极研究如何缓解这些问题,同时利用这些模型的强大功能。
While the advantages are many, it’s important to note that there are also challenges and potential risks associated with the use of LLMs. They require significant computational resources to train and deploy, and there are ongoing concerns related to their potential to generate harmful or biased content, their interpretability, and their environmental impact. Researchers are actively working on ways to mitigate these issues while leveraging the powerful capabilities of these models.
由于这些原因,公司正在尝试实施和培训更大的LM(图7 .3):
Due to these reasons, companies are trying to implement and train larger LMs (Figure 7.3):
Figure 7.3 – Newer LMs and their size, as well as the developers
发展LLMs提出了一系列独特的挑战,包括但不限于处理大量数据、需要大量计算资源以及引入或永久存在偏见的风险。以下小节概述了详细说明 这些挑战。
Developing LLMs poses a unique set of challenges, including but not limited to handling massive amounts of data, requiring vast computational resources, and the risk of introducing or perpetuating bias. The following subsections outline the detailed explanations of these challenges.
LLMs需要大量数据进行培训。随着模型规模的增长,对多样化、高质量训练数据的需求也随之增加。然而,收集和策划如此大规模的数据集是一项具有挑战性的任务。这可能既耗时又昂贵。还存在无意中在训练集中包含敏感或不适当数据的风险。为了获得更多想法,BERT 使用来自维基百科和 BookCorpus 的 33 亿个单词进行了训练。GPT-2 已使用 40 GB 文本数据进行训练,GPT-3 已使用 570 GB 文本数据进行训练。表 7.2显示了一些最近的 LM 的参数数量和训练数据大小。
LLMs require enormous amounts of data for training. As the model size grows, so does the need for diverse, high-quality training data. However, collecting and curating such large datasets is a challenging task. It can be time - consuming and expensive. There’s also the risk of inadvertently including sensitive or inappropriate data in the training set. To have more of an idea, BERT has been trained using 3.3 billion words from Wikipedia and BookCorpus. GPT-2 has been trained on 40 GB of text data, and GPT-3 has been trained on 570 GB of text data. Table 7.2 shows the number of parameters and size of training data of a few recent LMs.
|
模型 Model |
参数 Parameters |
训练数据大小 Size of training data |
|
GPT-3.5 GPT-3.5 |
175 乙 175 B |
3000亿代币 300 billion tokens |
|
GPT-3 GPT-3 |
175 乙 175 B |
3000亿代币 300 billion tokens |
|
棕榈 PaLM |
540乙 540 B |
7800亿个代币 780 billion tokens |
|
骆驼 LLaMA |
65乙 65 B |
1.4万亿代币 1.4 trillion tokens |
|
盛开 Bloom |
176 乙 176 B |
3660亿个代币 366 billion tokens |
表 7.2 – 一些最近的 LM 的参数数量和训练数据
Table 7.2 – Number of parameters and training data of a few recent LMs
培训LLMs需要大量的计算资源。这些模型通常拥有数十亿甚至数万亿的参数,在训练过程中需要处理大量数据,这需要高性能硬件(例如 GPU 或 TPU)和大量时间。这可能成本高昂,并且可能会限制只有拥有这些资源的人才能开发此类模型。例如,训练 GPT-3 需要 100 万个 GPU 小时,成本约为 460 万美元(2020 年)。表 7.3显示了一些最近的 LM 的计算资源和训练时间。
Training LLMs requires substantial computational resources. These models often have billions or even trillions of parameters and need to process vast amounts of data during training, which requires high-performance hardware (such as GPUs or TPUs) and a significant amount of time. This can be costly and could limit the accessibility of developing such models to only those who have these resources. For example, training GPT-3 took 1 million GPU hours, which cost around 4.6 million dollars (in 2020). Table 7.3 shows the computational resources and training time of a few recent LMs.
|
模型 Model |
硬件 Hardware |
训练时间 Training time |
|
棕榈 PaLM |
6144 TPU v4 6144 TPU v4 |
- - |
|
骆驼 LLaMA |
2048 80G A100 2048 80G A100 |
21天 21 days |
|
盛开 Bloom |
384 80G A100 384 80G A100 |
105天 105 days |
|
GPT-3 GPT-3 |
1024x A100 1024x A100 |
34天 34 days |
|
GPT-4 GPT-4 |
25000 A100 25000 A100 |
90–100 天 90–100 days |
表 7.3 – 一些最近的 LM 的硬件和训练时间
Table 7.3 – The hardware and training time of a few recent LMs
LLMs可以学习并延续训练数据中存在的偏见。这可能是明显的偏见,例如语言使用方式中的种族或性别偏见,也可能是更微妙的偏见,例如某些主题或观点的代表性不足。这个问题可能很难解决,因为语言偏见是一个根深蒂固的社会问题,而且通常很难确定在特定背景下什么可能被视为偏见。
LLMs can learn and perpetuate biases present in their training data. This could be explicit bias, such as racial or gender bias in the way language is used, or more subtle forms of bias, such as the underrepresentation of certain topics or perspectives. This issue can be challenging to address because bias in language is a deeply rooted societal issue, and it’s often not easy to even identify what might be considered bias in a given context.
确保LLMs在所有可能的情况下都能表现良好是一项挑战,特别是在输入与其训练数据不同的情况下。这包括处理不明确的查询、处理分布外的数据以及确保响应的一致性。确保模型没有过度训练有助于建立更稳健的模型,但要建立稳健的模型还需要做更多的工作。
It’s challenging to ensure that LLMs will perform well in all possible scenarios, particularly on inputs that differ from their training data. This includes dealing with ambiguous queries, handling out-of-distribution data, and ensuring a level of consistency in the responses. Making sure that the model is not overtrained can help to have a more robust model, but much more is needed to have a robust model.
与大多数深度学习( DL ) 模型一样,LLMs通常被描述为“黑匣子”。理解他们为什么做出特定的预测或他们如何得出结论并不容易。如果模型开始产生不正确或不适当的输出,这将使调试变得困难。提高可解释性是一个活跃的研究领域。例如,一些库试图通过采用特征重要性分析等技术来阐明 LM 的决策过程,其中涉及删除一些单词并分析梯度的变化。
LLMs, like most deep learning (DL) models, are often described as “black boxes.” It’s not easy to understand why they’re making a particular prediction or how they’re arriving at a conclusion. This makes debugging challenging if the model starts to produce incorrect or inappropriate outputs. Improving interpretability is an active area of research. For example, some libraries attempt to elucidate the decision-making process of an LM by employing techniques such as feature importance analysis, which involves removing some words and analyzing the change in gradients.
其中一种方法是输入扰动技术。在这种方法中,输入文本中的一个或多个单词被扰动或删除,并分析模型输出的变化。其背后的基本原理是了解特定输入词对模型输出预测的影响。如果删除某个单词显着改变了模型的预测,则可以推断模型认为该单词对其预测很重要。
One such method is the input perturbation technique. In this approach, a word (or words) from the input text is perturbed or removed, and the change in the model’s output is analyzed. The rationale behind this is to understand the influence of a specific input word on the model’s output prediction. If the removal of a certain word significantly changes the model’s prediction, it can be inferred that the model deemed this word as important for its prediction.
分析梯度变化是另一种流行的方法。通过研究当删除某个单词时输出相对于输入的梯度如何变化,我们可以深入了解该单词如何影响模型的决策过程 具体词。
Analyzing gradient changes is another popular method. By investigating how the gradient of the output with respect to the input changes when a certain word is removed, one can gain insight into how the model’s decision-making process is influenced by that specific word.
这些解释技术为LLMs复杂的决策过程提供了更透明的视角,使研究人员能够更好地理解和改进他们的模型。LIME 和 SHAP 等库提供了模型解释任务的工具,从而使研究人员更容易理解该过程。
These interpretation techniques provide a more transparent view into the complex decision-making process of LLMs, enabling researchers to better understand and improve their models. Libraries such as LIME and SHAP offer tools for model interpretation tasks, thus making the process more accessible to researchers.
培训LLMs所需的大量计算资源可能会对环境产生重大影响。训练这些模型所需的能源可能会导致碳排放,从可持续发展的角度来看,这是一个问题。
The high computational resources needed for training LLMs can have significant environmental implications. The energy required for training these models can contribute to carbon emissions, which is a concern from a sustainability perspective.
除此之外,LLMs的隐私和安全也令人担忧。例如,建议不要共享使用患者医疗信息训练的模型,或者不要将敏感信息提供给公开的 LLM(例如 ChatGPT),因为可以将其返回给其他用户作为问题的答案s。
Besides that, there are concerns about privacy and security in LLMs. For example, it is recommended not to share models that are trained using patients’ medical information, or not to feed sensitive information into publicly available LLMs such as ChatGPT, since it can return it to other users as the answer to their questions.
LLMs通常是在大型文本数据语料库上进行训练的神经网络架构。这术语“大”是指这些模型的参数数量和训练数据规模。以下是LLMs的一些示例。
LLMs are generally neural network architectures that are trained on a large corpus of text data. The term “large” refers to the size of these models in terms of the number of parameters and the scale of training data. Here are some examples of LLMs.
Transformer型号有一直处于最近的LLMs浪潮的最前沿。它们基于“Transformer”架构,该架构使用自注意力机制来权衡不同单词的相关性在进行预测时的输入中。 Transformer是 Vaswani 等人在《 Attention is All You Need》一文中介绍的一种神经网络架构。它们的显着优势之一是它们适合并行计算,特别是对于培训LLMs而言。
Transformer models have been at the forefront of the recent wave of LLMs. They are based on the “Transformer” architecture, which uses self-attention mechanisms to weigh the relevance of different words in the input when making predictions. Transformers are a type of neural network architecture introduced in the paper Attention is All You Need by Vaswani et al. One of their significant advantages, particularly for training LLMs, is their suitability for parallel computing.
在传统的 RNN 模型中,例如 LSTM 和 GRU,必须按顺序处理标记序列(文本中的单词、子词或字符)。这是因为每个标记的表示不仅取决于标记本身,还取决于序列中先前的标记。这些模型固有的顺序性质使得它们的操作很难并行化,这会限制训练过程的速度和效率。
In traditional RNN models, such as LSTM and GRU, the sequence of tokens (words, subwords, or characters in the text) must be processed sequentially. That’s because each token’s representation depends not only on the token itself but also on the previous tokens in the sequence. The inherent sequential nature of these models makes it difficult to parallelize their operations, which can limit the speed and efficiency of the training process.
相比之下,Transformer通过使用一种称为自注意力(或缩放点积注意力)。在自注意力过程中,每个令牌的表示被计算为序列中所有令牌的加权和,权重由注意力机制确定。重要的是,每个令牌的这些计算独立于其他令牌的计算,因此它们可以并行执行。
Transformers, in contrast, eliminate the necessity for sequential processing by using a mechanism called self-attention (or scaled dot-product attention). In the self-attention process, each token’s representation is computed as a weighted sum of all tokens in the sequence, with the weights determined by the attention mechanism. Importantly, these computations for each token are independent of the computations for other tokens, and thus they can be performed in parallel.
这种并行化能力为训练LLMs带来了几个优势,我们将在接下来讨论。
This parallelization capability brings several advantages for training LLMs as we will discuss next.
通过并行化在计算方面,Transformers 可以比 RNN 更快地处理大量数据。这种速度可以显着减少LLMs的训练时间,因为LLMs通常需要处理大量数据。
By parallelizing the computations, Transformers can process large amounts of data more quickly than RNNs. This speed can significantly reduce the training time of LLMs, which often need to process vast amounts of data.
Transformer 的并行化使得扩展模型大小和训练数据量变得更加容易。这种能力对于开发LLMs至关重要,因为这些模型通常会受益从更大的数据集上进行训练并拥有更多的参数。
Transformers’ parallelization makes it easier to scale up the model size and the amount of training data. This capability is crucial for developing LLMs, as these models often benefit from being trained on larger datasets and having a larger number of parameters.
Transformer可以更好捕获标记之间的远程依赖关系,因为它们同时考虑序列中的所有标记,而不是一次处理一个标记。这种能力在许多语言任务中都很有价值,可以提高LLMs的表现。
Transformers can better capture long-range dependencies between tokens because they consider all tokens in the sequence simultaneously, rather than processing them one at a time. This capability is valuable in many language tasks and can improve the performance of LLMs.
这些模型中的每一个都有自己的优点和缺点,模型的最佳选择可以取决于具体任务、可用训练数据的数量和类型以及计算资源可用。
Each of these models has its own strengths and weaknesses, and the best choice of model can depend on the specific task, the amount and type of available training data, and the computational resources available.
在这一部分,我们是在撰写本书时,我们将更深入地研究一些最新的LLMs的设计和架构。
In this part, we are going to dig more into the design and architecture of some of the newest LLMs at the time of writing this book.
核心ChatGPT 是一个 Transformer,是一种模型架构,在进行预测时使用自注意力机制来权衡输入中不同单词的相关性。它允许模型在生成时考虑输入的完整上下文一个回应。
The core of ChatGPT is a Transformer, a type of model architecture that uses self-attention mechanisms to weigh the relevance of different words in the input when making predictions. It allows the model to consider the full context of the input when generating a response.
ChatGPT 基于 GPT 版本的 Transformer。GPT 模型经过训练,可以在给定所有先前单词的情况下预测单词序列中的下一个单词。它们从左到右(单向上下文)处理文本,这使得它们非常适合文本生成任务。例如,GPT-3,它是其上的 GPT 版本之一ChatGPT为基础,包含1750亿参数。
ChatGPT is based on the GPT version of the Transformer. The GPT models are trained to predict the next word in a sequence of words, given all the previous words. They process text from left to right (unidirectional context), which makes them well-suited for text generation tasks. For instance, GPT-3, one of the versions of GPT on which ChatGPT is based, contains 175 billion parameters.
ChatGPT 的训练过程分两步完成:预训练和微调。
The training process for ChatGPT is done in two steps: pretraining and fine-tuning.
在这一步中,模型被训练来自互联网的大量公开文本。然而,值得注意的是,它不知道哪些文档在其训练集中或有权访问任何特定文档或来源的具体信息。
In this step, the model is trained on a large corpus of publicly available text from the internet. However, it’s worth noting that it does not know specifics about which documents were in its training set or have access to any specific documents or sources.
预训练后,基础模型在 OpenAI 创建的自定义数据集上进行进一步训练(微调),其中包括正确行为的演示以及对不同排名的比较回应。一些提示来自 Playground 和 ChatGPT 应用程序的用户,但它们是匿名的并剥离了个人身份信息电子信息。
After pretraining, the base model is further trained (fine-tuned) on custom datasets created by OpenAI, which include demonstrations of correct behavior as well as comparisons to rank different responses. Some prompts are from users of the Playground and ChatGPT apps, but they are anonymized and stripped of personally identifiable information.
微调过程的一部分涉及RLHF,人类 AI 训练员为一系列示例输入的模型输出提供反馈,并且该反馈用于改进模型的响应。RLHF 是用于训练 ChatGPT 的微调过程的关键组成部分。这是一种通过学习人类评估者提供的反馈来改进模型性能的技术。这里,我们首先解释一下RLHF的总体思路,在下一节中,我们将逐步解释它。
Part of the fine-tuning process involves RLHF, where human AI trainers provide feedback on model outputs for a range of example inputs, and this feedback is used to improve the model’s responses. RLHF is a key component of the fine-tuning process used to train ChatGPT. It’s a technique for refining the performance of the model by learning from feedback provided by human evaluators. Here, we first explain the general idea of RLHF, and in the next section, we explain it step by step.
RLHF 的第一步是收集人类反馈。对于 ChatGPT,这通常涉及让人类 AI 培训师参与双方(用户和 AI 助手)的对话。培训师还可以访问模型编写的建议来帮助他们撰写回复。人工智能培训师本质上是在与自己对话,这种对话被添加到数据集中进行微调。
The first step in RLHF is to collect human feedback. For ChatGPT, this often involves having human AI trainers participate in conversations where they play both sides (the user and the AI assistant). The trainers also have access to model-written suggestions to help them compose responses. This dialogue, in which AI trainers are essentially having a conversation with themselves, is added to the dataset for fine-tuning.
除了对话之外,还创建比较数据,其中多个模型响应按质量排名。这是通过轮流对话、生成几种不同的完成结果(响应)并让人类评估者对它们进行排名来完成的。评估者不只需根据事实正确性以及他们认为回复的有用性和安全性对回复进行排名即可。
In addition to the dialogues, comparison data is created where multiple model responses are ranked by quality. This is done by taking a conversation turn, generating several different completions (responses), and having human evaluators rank them. The evaluators don’t just rank the responses on factual correctness but also on how useful and safe they judged the response to be.
然后使用近端策略优化( PPO )(一种强化学习算法)对该模型进行微调。 PPO 尝试根据人类反馈改进模型的响应,使得对模型参数进行小幅调整,以增加获得较好评价响应的可能性并降低获得较差评价响应的可能性。
The model is then fine-tuned using proximal policy optimization (PPO), a reinforcement learning algorithm. PPO attempts to improve the model’s responses based on human feedback, making small adjustments to the model’s parameters to increase the likelihood of better-rated responses and decrease the likelihood of worse-rated responses.
RLHF 是一个迭代过程。收集人类反馈、创建比较数据以及使用 PPO 微调模型的过程会重复多次,以逐步改进模型。接下来,我们将更详细地解释PPO 的工作原理。
RLHF is an iterative process. The procedure of collecting human feedback, creating comparison data, and fine-tuning the model using PPO is repeated multiple times to incrementally improve the model. Next, we will explain in more detail how PPO works.
PPO 是一种强化学习算法,用于优化智能体的 π 策略。该策略定义了代理如何根据其当前状态选择操作。PPO 旨在优化这一政策,以最大化预期累积奖励。
PPO is a reinforcement learning algorithm used to optimize the π policy of an agent. The policy defines how the agent selects actions based on its current state. PPO aims to optimize this policy to maximize the expected cumulative rewards.
在深入研究 PPO 之前,定义奖励模型很重要。在强化学习的背景下,奖励模型是一个R ( s , a ) 函数,它将奖励值分配给每个状态-动作对 ( s , a )。代理的目标是学习一个策略 π,使这些奖励的预期总和最大化。
Before diving into PPO, it’s important to define the reward model. In the context of reinforcement learning, the reward model is a R(s, a) function, which assigns a reward value to every state-action pair (s, a). The goal of the agent is to learn a policy π that maximizes the expected sum of these rewards.
从数学上讲,强化学习的目标可以定义如下:
Mathematically, the objective of reinforcement learning can be defined as follows:
在此公式中,E π[.]是对遵循策略π生成的轨迹(状态-动作对的序列)的期望,s _t是时间t时的状态,a _t是时间t采取的动作,R( s _t , a _t )是在时间t收到的奖励。
In this formula, Eπ[.] is the expectation over trajectories (sequences of state-action pairs) generated by following policy π, s_t is the state at time t, a_t is the action taken at time t, and R(s_t, a_t) is the reward received at time t.
PPO 修改了这一目标,以鼓励探索政策空间,同时防止每次更新时政策发生过于剧烈的变化。这是通过引入一个比率r _t(θ)来完成的,它表示当前策略π _θ与旧策略π_θ_old的概率之比:
PPO modifies this objective to encourage exploration of the policy space while preventing too drastic changes in the policy at each update. This is done by introducing a ratio, r_t(θ), which represents the ratio of the probabilities of the current policy π_θ to the old policy π_θ_old:
The objective of PPO is then defined as follows:
这里,A_t是优势函数,用于衡量采取动作a_t与状态s_ t下的平均动作相比有多好,clip(r_ t (θ), 1 - ε, 1 + ε)是r_的剪辑版本t (θ)不鼓励太大的策略更新。
Here, A_t is the advantage function that measures how much better the taking action a_t is compared to the average action at state s_t, and clip(r_t(θ), 1 - ε, 1 + ε) is a clipped version of r_t(θ) that discourages too large policy updates.
然后,该算法使用随机梯度上升来优化该目标,调整策略参数θ以增加J_ PPO (π)。
The algorithm then optimizes this objective using stochastic gradient ascent, adjusting the policy parameters θ to increase J_PPO(π).
在 ChatGPT 和 RLHF 的背景下,状态对应于对话历史,动作对应于模型生成的消息,奖励对应于人类对这些消息的反馈。PPO 用于调整模型参数,以提高根据人类反馈判断生成的消息的质量。
In the context of ChatGPT and RLHF, the states correspond to the conversation histories, the actions correspond to the model-generated messages, and the rewards correspond to the human feedback on these messages. PPO is used to adjust the model parameters to improve the quality of the generated messages as judged by the human feedback.
人类排名用于创建奖励模型,该模型量化每个响应的好坏程度。奖励模型是一个函数,它接受状态和操作(在本例中为对话上下文和模型生成的消息),并输出标量奖励。在训练过程中,模型试图最大化其预期累积奖励。
The human rankings are used to create a reward model, which quantifies how good each response is. The reward model is a function that takes in a state and an action (in this case, a conversation context and a model-generated message), and outputs a scalar reward. During training, the model tries to maximize its expected cumulative reward.
RLHF 的目标是使模型的行为与人类价值观保持一致,并提高其生成有用且安全的响应的能力。通过学习人类反馈,ChatGPT 可以适应更广泛的对话环境,并提供更合适、更有帮助的响应。值得注意的是,尽管做出了这些努力,系统仍然可能会出错,处理这些错误并改进 RLHF 流程是一个领域 正在进行的研究。
The goal of RLHF is to align the model’s behavior with human values and to improve its ability to generate useful and safe responses. By learning from human feedback, ChatGPT can adapt to a wider range of conversational contexts and provide more appropriate and helpful responses. It’s worth noting that despite these efforts, the system might still make mistakes, and handling these errors and improving the RLHF process is an area of ongoing research.
生成响应时,ChatGPT 将对话历史记录作为输入,其中包括对话历史记录中的先前消息对话以及最新的用户消息,并生成模型生成的消息作为输出。对话历史记录被标记化并输入到模型中,该模型会生成一系列标记作为响应,然后对这些标记进行去标记以形成 最终输出文本。
When generating a response, ChatGPT takes as input a conversation history, which includes previous messages in the conversation along with the most recent user message and produces a model-generated message as output. The conversation history is tokenized and fed into the model, which generates a sequence of tokens in response, and these tokens are then detokenized to form the final output text.
OpenAI 还实施了一些系统级控制,以减少 ChatGPT 的有害或不真实的输出。这包括警告或阻止某些类型的审核 APIes不安全内容。
OpenAI has also implemented some system-level controls to mitigate harmful or untruthful outputs from ChatGPT. This includes a Moderation API that warns or blocks certain types of unsafe content.
由于 RLHF 是 ChatGPT 和其他几个最先进( SOTA ) 模型的重要组成部分,因此更好地理解它对您很有用。近年来,LM 表现出了非凡的能力,创造了基于人类生成的提示的多样化且引人注目的文本。尽管如此,准确定义什么构成“好”文本仍然具有挑战性,因为它本质上是主观的并且取决于上下文。例如,虽然编写故事需要创造力,但信息片段需要准确性,并且代码片段需要可执行。
Since RLHF is an important part of ChatGPT and several other State of the Art (SOTA) models, understanding it better is useful to the you. In recent years, LMs have demonstrated remarkable abilities, creating varied and compelling text based on human-generated prompts. Nonetheless, it’s challenging to precisely define what constitutes “good” text as it is inherently subjective and depends on the context. For instance, while crafting stories demands creativity, informative pieces require accuracy, and code snippets need to be executable.
定义一个损失函数来封装这些属性似乎几乎是不可能的,因此大多数 LM 都是使用基本的下一个标记预测损失(例如交叉熵)进行训练的。为了克服损失函数的局限性,人们开发了更符合人类偏好的指标,例如 BLEU 或ROUGE。BLEU分数(即双语评估研究)是一种指标,用于衡量机器翻译文本与一组参考翻译的比较情况。尽管这些指标在评估绩效方面更有效,但它们本质上是有限的,因为它们只是使用基本规则将生成的文本与参考文献进行比较。
Defining a loss function to encapsulate these attributes seems virtually impossible, hence most LMs are trained using a basic next-token prediction loss, such as cross-entropy. To overcome the limitations of the loss function, individuals have developed metrics that better align with human preferences, such as BLEU or ROUGE. The BLEU score, or Bilingual evaluation understudy, is a metric which is used to measure how well machine-translated text compares to a set of reference translations. Although these metrics are more effective at assessing performance, they are inherently limited as they merely compare the generated text to references using basic rules.
如果我们可以使用对生成文本的人类反馈作为性能衡量标准,甚至更好地作为优化模型的损失,这不是变革吗?这就是 RLHF 背后的概念——利用强化学习技术,利用人类反馈直接优化 LM。RLHF 已开始使 LM 能够将在通用文本语料库上训练的模型与复杂的人类价值观相结合。
Wouldn’t it be transformative if we could use human feedback on generated text as a performance measure, or even better, as a loss to optimize the model? This is the concept behind RLHF – leveraging reinforcement learning techniques to directly optimize an LM using human feedback. RLHF has begun to enable LMs to align a model trained on a general text corpus with intricate human values.
RLHF 最近的成功应用之一是 ChatGPT 的开发。
One of the recent successful applications of RLHF has been in the development of ChatGPT.
由于其多方面的模型训练过程和不同的部署阶段,RLHF 的概念提出了巨大的挑战。在这里,我们将把训练过程分解为三个基本组成部分:
The concept of RLHF presents a formidable challenge due to its multifaceted model training process and various deployment phases. Here, we’ll dissect the training procedure into its three essential components:
We’ll begin by examining the pretraining phase for LMs.
作为基础,RLHF 使用已经使用传统方法进行预训练的 LM预训练目标,这意味着我们根据训练数据创建分词器,设计模型架构,然后使用训练数据预训练模型。对于其最初广受好评的 RLHF 模型 InstructGPT,OpenAI 采用了 GPT-3 的较小版本。另一方面,Anthropic 使用了为此任务训练的 1000 万到 520 亿个参数的 Transformer 模型,而 DeepMind 使用了其 2800 亿个参数的模型 Gopher。
As a foundation, RLHF utilizes an LM that’s already been pretrained using traditional pretraining objectives, which means that we create the tokenizer based on our training data, design model architecture, and then pretrain the model using the training data. For its initial well-received RLHF model, InstructGPT, OpenAI employed a smaller version of GPT-3. On the other hand, Anthropic used transformer models ranging from 10 million to 52 billion parameters trained for this task, and DeepMind utilized its 280 billion parameter model, Gopher.
这个初步模型可以根据额外的文本或特定条件进一步完善,尽管这并不总是必要的。例如,OpenAI 选择使用识别为“首选”的人工生成文本来完善其模型。该数据集用于使用 RLHF 模型进一步微调模型,根据人类的上下文提示提炼原始的 LM 模型。
This preliminary model may be further refined on extra text or particular conditions, although it’s not always necessary. As an example, OpenAI chose to refine its model using human-generated text identified as “preferable.” This dataset is used to further fine-tune the model using the RLHF model, distilling the original LM model based on contextual hints from humans.
一般来说,“哪种模型”作为 RLHF 的最佳启动点这个问题没有明确的答案。可用于 RLHF 培训的一系列选项尚未得到广泛探索。
Generally speaking, there’s no definitive answer to the question of “which model” serves as the best launching point for RLHF. The array of options available for RLHF training has not been extensively explored.
接下来,一旦 LM 就位,就需要生成数据来训练奖励模型。这一步对于将人类偏好融入系统至关重要。
Moving on, once an LM is in place, it’s necessary to generate data to train a reward model. This step is crucial for integrating human preferences into the system.
在新提出的方法中,RLHF 被用作 RM,这被称为偏好模型:出色地。这里的主要思想是获取文本并返回反映人类偏好的标量奖励。这种方法可以通过两种方式实现。首先,实施端到端LLMs,这为我们提供了首选输出。这个过程可以通过微调LLMs或从头开始训练LLMs来执行。其次,有一个额外的组件,可以对 LLM 的不同输出进行排名并返回最好的输出。
In the newly proposed method, RLHF is being used as the RM, which is known as a preference model as well. The main idea here is to get a text and return a scalar reward that reflects human preferences. This approach can be implemented in two ways. First, implement an end-to-end LLM, which gives us the preferred output. This process can be performed by fine-tuning a LLM or training a LLM from scratch. Second, have an extra component that ranks different outputs of the LLM and returns the best one.
用于训练 RM 的数据集是一组提示生成对。提示是从预定的数据集(Anthropic 的数据)中采样的。这些提示经过初始 LM 的处理以生成新文本。
The dataset used for training the RM is a set of prompt-generation pairs. Prompts are sampled from a predetermined dataset (Anthropic’s data). These prompts undergo processing by the initial LM to generate fresh text.
人类注释者对 LM 生成的文本输出进行排名。让人类直接为每个文本片段分配标量分数以生成奖励模型似乎很直观,但事实证明这具有挑战性。不同的人类价值观使得这些分数变得不标准化且不可靠。因此,采用排名来比较多个模型输出,从而创建一个更好的正则化数据集。
Human annotators rank the text outputs generated by the LM. It might seem intuitive to have humans directly assign a scalar score to each text piece to generate a reward model, but it proves challenging in reality. Varied human values render these scores unstandardized and unreliable. Consequently, rankings are employed to compare multiple model outputs, thereby creating a substantially better regularized dataset.
文本排名有多种策略。一种成功的方法是用户在给出相同提示的情况下比较两个 LM 生成的文本。通过直接比较评估模型输出,我们将很快描述的Elo 评级系统可以生成模型排名并输出相对于彼此。然后将这些不同的排名方法标准化为用于训练的标量奖励信号。Elo 评级系统最初是为国际象棋开发的,也适用于LM 的 RLHF。
There are several strategies for text ranking. One successful approach involves users comparing the text produced by two LMs given the same prompt. By evaluating model outputs in direct comparison, an Elo rating system, which we will soon describe, can generate a ranking of models and outputs relative to each other. These varying ranking methods are then normalized into a scalar reward signal for training. The Elo rating system, originally developed for chess, is also applicable to RLHF for LMs.
在 LM 的背景下,每个模型或变体(例如,处于不同训练阶段的模型)都可以被视为“玩家”。它的 Elo 评级反映了它在生成人类偏好的输出方面的表现。
In the context of LMs, each model or variant (e.g., models at different stages of training) can be seen as a “player.” Its Elo rating reflects how well it performs in terms of generating human-preferred outputs.
Elo 评级系统的基本机制保持不变。以下是它如何适用于LM 中的 RLHF:
The fundamental mechanics of the Elo rating system remain the same. Here’s how it can be adapted for RLHF in LMs:
每次评估后,Elo 评级都会以这种方式更新。随着时间的推移,它们会根据人类偏好对模型进行持续、动态的排名。这对于跟踪训练过程中的进度以及比较不同的模型或模型变体非常有用。
The Elo ratings are updated in this way after each evaluation. Over time, they provide an ongoing, dynamic ranking of the models based on human preferences. This is useful for tracking progress over the course of training and for comparing different models or model variants.
成功的 RLHF 系统采用了与文本生成相关的不同大小的奖励 LM。例如,OpenAI 使用 175 B LM 和 6 B 奖励模型,Anthropic 使用 10 B 到 52 B 范围内的 LM 和奖励模型,而 DeepMind 使用 70 B Chinchilla 模型作为 LM 和奖励模型。这是因为偏好模型必须与理解文本所需的能力相匹配,因为模型需要生成文本。在 RLHF 过程的这个关头,我们拥有一个能够生成文本的初始 LM 和一个基于人类感知为任何文本分配分数的偏好模型。接下来我们应用强化学习来优化原始的LM关于奖励模型。
Successful RLHF systems have employed diverse-sized reward LMs relative to text generation. For example, OpenAI used a 175 B LM with a 6 B reward model, Anthropic utilized LM and reward models ranging from 10 B to 52 B, and DeepMind employed 70 B Chinchilla models for both the LM and reward model. This is because preference models must match the capacity needed to understand a text as a model would need to generate it. At this juncture in the RLHF process, we possess an initial LM capable of text generation and a preference model that assigns a score to any text based on human perception. We next apply reinforcement learning to optimize the original LM concerning the reward model.
Figure 7.5 – The reward model for reinforcement learning
在相当长的一段时间内,由于技术和算法方面的挑战,使用强化学习来训练 LM 的前景被认为是无法实现的。然而,几个组织已经使用策略梯度强化学习算法(即 PPO)实现了初始 LM 副本的部分或全部参数的微调。LM 的参数保持静态,因为使用 10 个 B 或 100 个 B+ 参数微调整个模型的成本高昂(有关更多详细信息,请参阅LM 的低阶适应( LoRA )或 DeepMind 的 Sparrow LM)。PPO 是一种已建立的方法已有一段时间了,有大量可用的指南解释其功能。这种成熟度使其成为扩展 RLHF 分布式训练的新颖应用的有吸引力的选择。通过确定如何使用已知算法更新如此庞大的模型(稍后会详细介绍),RLHF 似乎已经取得了重大进展。
For a considerable period, the prospect of training an LM using reinforcement learning was considered unattainable due to both technical and algorithmic challenges. However, several organizations have achieved fine-tuning some or all parameters of a replica of the initial LM with a policy-gradient reinforcement learning algorithm – namely, PPO. Parameters of the LM are kept static because fine-tuning an entire model with 10 B or 100 B+ parameters is prohibitively expensive (for further details, refer to Low-Rank Adaptation (LoRA) for LMs or DeepMind’s Sparrow LM). PPO, an established method for some time now, has abundant available guides explaining its functioning. This maturity made it an attractive choice for scaling up to the novel application of distributed training for RLHF. It appears that significant strides in RLHF have been made by determining how to update such a colossal model with a known algorithm (more on that later).
我们可以将这个微调任务表述为强化学习问题。最初,该策略是一个 LM,它接受提示并生成一系列文本(或仅仅是文本上的概率分布)。该策略的动作空间是与 LM 词汇表一致的所有标记(通常大约 50 K 个标记),观察空间是可能的输入标记序列的分布,考虑到强化学习的先前用途,该空间也非常大(维度近似于词汇量大小幂 ( ^ ) 输入标记序列的长度)。奖励函数将偏好模型与政策转变的约束融合在一起。
We can articulate this fine-tuning task as a reinforcement learning problem. Initially, the policy is an LM that accepts a prompt and produces a sequence of text (or merely probability distributions over text). The action space of this policy is all the tokens aligning with the LM’s vocabulary (typically around 50 K tokens), and the observation space is the distribution of possible input token sequences, which is also notably large in light of reinforcement learning’s prior uses (the dimension approximates the vocabulary size power (^) length of the input token sequence). The reward function melds the preference model with a constraint on policy shift.
奖励函数是系统将所有讨论的模型集成到单个 RLHF 过程中的关键点。给定数据集中的提示x,文本y由微调策略的当前迭代创建。该文本与原始提示一起被传递到偏好模型,该模型返回“偏好”的标量度量。
The reward function is the juncture where the system integrates all the models discussed into a single RLHF process. Given a prompt, x, from the dataset, the text, y, is created by the current iteration of the fine-tuned policy. This text, coupled with the original prompt, is passed to the preference model, which returns a scalar measure of “preferability”, .
此外,将强化学习策略中的每个标记的概率分布与初始模型中的概率分布进行对比,以计算它们差异的惩罚。在 OpenAI、Anthropic 和 DeepMind 的几篇论文中,这种惩罚被构建为这些令牌分布序列之间的Kullback-Leibler ( KL ) 散度的缩放版本。KL 散度项惩罚强化学习策略,使其在每个训练批次中显着偏离初始预训练模型,从而确保生成合理连贯的文本片段。
Additionally, per-token probability distributions from the reinforcement learning policy are contrasted with those from the initial model to compute a penalty on their difference. In several papers from OpenAI, Anthropic, and DeepMind, this penalty has been constructed as a scaled version of the Kullback–Leibler (KL) divergence between these sequences of distributions over tokens, . The KL divergence term penalizes the reinforcement learning policy from veering significantly from the initial pretrained model with each training batch, ensuring the production of reasonably coherent text snippets.
如果没有这种惩罚,优化可能会开始生成乱码文本,以某种方式欺骗奖励模型授予高额奖励。实际上,KL 散度是通过对两个分布进行采样来近似的。最终传递给强化学习更新规则的奖励如下:
Without this penalty, the optimization might start generating gibberish text that somehow deceives the reward model into granting a high reward. In practical terms, the KL divergence is approximated via sampling from both distributions. The final reward transmitted to the reinforcement learning update rule is as follows:
一些 RLHF 系统已将附加项纳入奖励函数中。为了例如,OpenAI 的 InstructGPT 成功尝试将额外的预训练梯度(来自人类注释集)混合到 PPO 的更新规则中。预计随着 RLHF 的不断研究,这种奖励函数的公式将不断发展。
Additional terms have been incorporated into the reward function by some RLHF systems. For instance, OpenAI’s InstructGPT successfully tried the blending of additional pretraining gradients (from the human annotation set) into the update rule for PPO. It is anticipated that as RLHF continues to be studied, the formulation of this reward function will continue to evolve.
最后,更新规则是来自 PPO 的参数更新,优化当前数据批次中的奖励指标(PPO 是在策略的,这意味着参数仅使用当前批次的提示生成对进行更新)。PPO 是一种信赖域优化算法,它采用梯度约束来确保更新步骤不会破坏学习过程的稳定性。DeepMind 对 Gopher 使用了类似的奖励设置,但采用了同步优势参与者。
Finally, the update rule is the parameter update from PPO that optimizes the reward metrics in the current data batch (PPO is on-policy, meaning the parameters are only updated with the current batch of prompt-generation pairs). PPO is a trust region optimization algorithm that employs constraints on the gradient to ensure the update step does not destabilize the learning process. DeepMind utilized a similar reward setup for Gopher but employed a synchronous advantage actor.
图 7.6 – 使用强化学习微调模型
Figure 7.6 – Fine-tuning the model using reinforcement learning
上图可能表明两个模型对同一提示产生不同的响应,但实际发生的是强化学习策略生成文本,该文本然后提供给初始模型以导出KL 惩罚的相对概率。
The preceding diagram may suggest that both models produce different responses for the same prompt, but what actually occurs is that the reinforcement learning policy generates text, which is then supplied to the initial model to derive its relative probabilities for the KL penalty.
或者,RLHF 可以通过循环更新奖励模型和策略来从这个阶段前进。随着强化学习策略的发展,用户可以根据模型之前的版本来维护这些输出的排名。然而,大多数论文尚未解决此操作的实现,因为收集此类数据所需的部署模式仅适用于可以访问活跃用户群的对话代理。Anthropic 将这种替代方案称为迭代在线 RLHF(如原始论文中所述),其中策略的迭代被纳入 Elo跨模型的排名系统。这带来了政策和奖励模型不断演变的复杂动态,代表了一个复杂且尚未解决的研究问题。在下一节中,我们将解释一些众所周知wn RLHF开源工具。
Optionally, RLHF can advance from this stage by cyclically updating both the reward model and the policy. As the reinforcement learning policy evolves, users can maintain the ranking of these outputs against the model’s previous versions. However, most papers haven’t yet addressed the implementation of this operation since the mode of deployment required to collect this type of data only works for dialogue agents who can access an active user base. Anthropic mentions this alternative as iterated online RLHF (as referred to in the original paper), where iterations of the policy are incorporated into the Elo ranking system across models. This brings about complex dynamics of the policy and reward model evolving, representing a complex and unresolved research question. In the next section, we will explain some well-known open-source tools for RLHF.
在撰写本书时,我们对GPT-4模型设计知之甚少。正如 OpenAI 一样虽然缓慢透露,但假设 GPT-4 不是单一模型,而是 8 个 2200 亿参数模型的组合,这一假设得到了 AI 社区关键人物的证实。这一假设表明 OpenAI 使用了“专家混合”策略,这是一种甚至早于LLMs的机器学习设计策略来创建模型。然而,虽然我们(作者)支持这一假设,但尚未得到OpenAI 的正式确认。
At the time of writing this book, we know very little about the GPT-4 model design. As OpenAI is slow to reveal, it is assumed that GPT-4 is not a single model but a combination of eight 220-billion-parameter models, an assumption that is confirmed by key figures in the AI community. This assumption suggests OpenAI used a “mixture of experts” strategy, an ML design tactic that dates even before LLMs, to create the model. However, while we, the authors, support this assumption, it has not been officially confirmed by OpenAI.
尽管有这样的猜测,但 GPT-4 令人印象深刻的性能是不可否认的,无论其内部结构如何。它在编写和编码任务方面的能力非常出色,无论是一个模型还是八个捆绑在一起的具体细节都不会改变其影响力。
Despite the speculation, GPT-4’s impressive performance is undeniable, regardless of its internal structure. Its capabilities in writing and coding tasks are remarkable, and the specifics of whether it’s one model or eight bundled together does not change its impact.
一种常见的说法表明,OpenAI 巧妙地管理了对 GPT-4 的期望,专注于其功能,并由于竞争压力而选择不披露规格。围绕 GPT-4 的保密性导致许多人不相信让它成为一个科学奇迹。
A common narrative suggests that OpenAI has managed expectations around GPT-4 deftly, focusing on its power and opting not to disclose specifications due to competitive pressures. The secrecy surrounding GPT-4 has led many to believe it to be a scientific marvel.
元有公开推出 LLaMA,高性能LLMs旨在帮助人工智能研究人员。这一举措使得无法访问广泛基础设施的个人能够检查这些模型,从而扩大了这个快速发展领域的访问范围。
Meta has publicly launched LLaMA, a high-performing LLM aimed at aiding researchers in AI. This move allows individuals with limited access to extensive infrastructure to examine these models, thus broadening access in this rapidly evolving field.
LLaMA 模型之所以有吸引力,是因为它们需要的计算能力和资源显着减少,从而可以探索新的方法和用例。这些模型有多种尺寸可供选择,旨在针对各种任务进行微调,并采用负责任的人工智能实践进行开发。
LLaMA models are attractive because they require significantly less computational power and resources, allowing for the exploration of new approaches and use cases. Available in several sizes, these models are designed to be fine-tuned for various tasks and have been developed with responsible AI practices.
尽管LLMs取得了进步,但由于培训和运行它们所需的资源,其研究可及性有限。较小的模型(例如 LLaMA)在更多令牌上进行训练,更容易针对特定用例进行重新训练和调整。
LLMs, despite their advancements, have limited research accessibility due to the resources required to train and run them. Smaller models, such as LLaMA, trained on more tokens, are simpler to retrain and adjust for specific use cases.
与其他类似模型中,LLaMA 将单词序列作为输入来预测下一个单词并生成文本。尽管 LLaMA 有能力,但它也面临着与其他产品相同的挑战关于偏见、有毒评论和幻觉的模型。通过共享 LLaMA 的代码,Meta 使研究人员能够在LLMs中测试解决这些问题的新方法。
Similar to other models, LLaMA takes a sequence of words as input to predict the next word and generate text. Despite its capabilities, LLaMA shares the same challenges as other models regarding bias, toxic comments, and hallucinations. By sharing LLaMA’s code, Meta enables researchers to test new ways of addressing these issues in LLMs.
Meta 强调整个人工智能社区需要合作,围绕负责任的人工智能和LLMs建立指导方针。他们预计 LLaMA 将促进新的le该领域的学习和发展。
Meta emphasizes the need for cooperation across the AI community to establish guidelines around responsible AI and LLMs. They anticipate that LLaMA will facilitate new learning and development in the field.
PaLM 是一个拥有 5400 亿个参数、密集激活的Transformer LM 使用 Pathways(一种新的 ML 系统)在 6,144 个 TPU v4 芯片上进行了训练,实现跨多个 TPU pod 的高效训练。
PaLM is a 540-billion-parameter, densely-activated Transformer LM that was trained on 6,144 TPU v4 chips using Pathways, a new ML system, that enables highly efficient training across multiple TPU pods.
PaLM 已被证明在各种自然语言任务上实现了突破性的性能,包括:
PaLM has been shown to achieve breakthrough performance on a variety of natural language tasks, including the following:
BIG-bench 基准值得扩展,因为它是公认的衡量基准的集合。BIG-bench 是专门为大规模 LM 设计的广泛评估机制。它是一个基础广泛、以社区为中心的基准,提供了多种任务来评估模型在不同学科中的表现及其在自然语言理解、问题解决和推理方面的能力。BIG-bench 共有来自 132 个机构的 450 名贡献者的 204 项任务,涵盖了各种学科,包括语言学、儿童发展、数学、常识推理、生物学、物理、软件开发,甚至社会偏见。它重点关注目前被认为超出现有 LM 能力范围的挑战。BIG-bench 的主要目标超出了单纯的模仿或图灵测试式评估,而是旨在对计算机的能力和约束进行更深入、更细致的评估。这些大型模型。这一举措的基础是这样的信念:开放、协作的评估方法可以为更全面地了解这些 LM 及其潜在的社会影响铺平道路。
The BIG-bench benchmark is worth expanding on as it serves as a recognized collection of benchmarks to measure against. The BIG-bench is an extensive assessment mechanism specifically designed for large-scale LMs. It is a broad-based, community-focused benchmark that presents a diversity of tasks to evaluate a model’s performance in different disciplines and its competence in natural language comprehension, problem solving, and reasoning. With a total of 204 tasks from 450 contributors across 132 institutions, BIG-bench covers an eclectic mix of subjects including linguistics, childhood development, mathematics, common-sense reasoning, biology, physics, software development, and even social bias. It concentrates on challenges believed to be currently beyond the reach of existing LMs. The primary goal of BIG-bench extends beyond mere imitation or Turing test-style evaluations, aiming instead for a deeper, more nuanced appraisal of the abilities and constraints of these large models. This initiative is founded on the conviction that an open, collaborative approach to evaluation paves the way for a more comprehensive understanding of these LMs and their potential societal ramifications.
PaLM 540B 在各种多步骤推理任务中的表现超越了经过微调的最先进水平,并且在 BIG-bench 基准测试中超越了人类的平均表现。随着 PaLM 扩展到其最大尺寸,许多 BIG-bench 任务在性能上表现出显着的飞跃,表明模型规模的不连续改进。PaLM 在多语言任务和源代码生成方面也具有强大的能力。例如,PaLM 可以在 50 种语言之间进行翻译,并且可以生成多种编程语言的代码。
PaLM 540B excels beyond the fine-tuned state-of-the-art across various multi-step reasoning tasks and surpasses average human performance on the BIG-bench benchmark. Many BIG-bench tasks exhibit significant leaps in performance as PaLM scales to its largest size, demonstrating discontinuous improvements from the model scale. PaLM also has strong capabilities in multilingual tasks and source code generation. For example, PaLM can translate between 50 languages, and it can generate code in a variety of programming languages.
该书的作者PaLM 论文还讨论了与LLMs相关的道德考虑因素,并讨论了潜在的缓解策略。例如,他们建议,重要的是要意识到LLMs可能存在的偏见,并且开发技术也很重要。检测和减轻偏见的问题。
The authors of the PaLM paper also discuss the ethical considerations related to LLMs, and they discuss potential mitigation strategies. For example, they suggest that it is important to be aware of the potential for bias in LLMs and that it is important to develop techniques for detecting and mitigating bias.
PaLM 在解码器独有的设置中采用传统的 Transformer 模型架构,该架构允许每个时间步仅关注其自身和之前的时间步。多次修改应用于此设置,包括以下内容:
PaLM employs the conventional Transformer model architecture in a decoder-exclusive setup, which allows each timestep to attend only to itself and preceding timesteps. Several modifications were applied to this setup, including the following:
标准结构如下:
并行结构如下:
由于 MLP 和注意力输入矩阵乘法的融合,这使得较大规模的训练速度提高了大约 15%。
The standard structure is given by the following:
The parallel structure is instead the following:
This leads to roughly 15% quicker training speed at larger scales due to the fusion of MLP and attention input matrix multiplications.
总的来说,PaLM 是一个功能强大的 LM,有潜力用于各种应用。它仍在开发中,但已经展示了实现这一目标的能力布雷akthrough 执行多项任务的表现。
Overall, PaLM is a powerful LM that has the potential to be used for a wide variety of applications. It is still under development, but it has already demonstrated the ability to achieve breakthrough performance on a number of tasks.
OpenAI 在 2019 年发布了第一个执行 RLHF 的开源代码。他们已经实现了这种方法来针对不同的用例(例如摘要)改进 GPT-2。基于人类反馈,该模型经过优化,输出类似于人类,例如复制笔记的部分内容。有关该项目的更多信息可以在以下链接中找到:https://openai.com/research/fine-tuning-gpt-2。
OpenAI released the first open-source code to perform RLHF in 2019. They have implemented this approach to improve GPT-2 for different use cases such as summarization. Based on human feedback, the model is optimized to have outputs similar to humans, such as copying parts of the note. More information about this project can be found at the following link: https://openai.com/research/fine-tuning-gpt-2.
该代码也可从以下链接获取:https://github.com/openai/lm- human - preferences。
The code is also available at the following link: https://github.com/openai/lm-human-preferences.
Transformers 强化学习( TRL ) 是一种工具,旨在使用 PPO 来微调预训练的 LM拥抱脸生态系统。TRLX 是 CarperAI 开发的增强型分叉,能够处理更大的模型进行在线和离线训练。目前,TRLX 配备了可用于生产的 API,支持带有 PPO 的 RLHF 和隐式语言 Q-learning ( ILQL )部署包含多达 330 亿个参数的 LLM。TRLX 的未来版本旨在容纳多达 2000 亿个参数的 LM,使其成为从事此类规模工作的 ML 工程师的理想选择。
Transformers Reinforcement Learning (TRL) is a tool crafted for fine-tuning pretrained LMs using PPO within the Hugging Face ecosystem. TRLX, an enhanced fork developed by CarperAI, is capable of handling larger models for both online and offline training. Currently, TRLX is equipped with a production-ready API supporting RLHF with PPO and implicit language Q-learning (ILQL) for deploying LLMs of up to 33 billion parameters. Future versions of TRLX aim to accommodate LMs of up to 200 billion parameters, making it ideal for ML engineers working at such scales.
另一个好的库是语言模型强化学习(RL4LMs)。RL4LMs 项目解决了培训LLMs以符合人类偏好指标的挑战。它认识到许多 NLP 任务可以被视为序列学习问题,但由于强化学习训练不稳定、自动化 NLP 指标的高方差和奖励黑客等问题,它们的应用受到限制。这项目通过执行以下操作提供解决方案:
Another good library is Reinforcement Learning for Language Models (RL4LMs). The RL4LMs project addresses the challenge of training LLMs to align with human preference metrics. It recognizes that many NLP tasks can be seen as sequence learning problems, but their application is limited due to issues such as reinforcement learning training instability, high variance in automated NLP metrics, and reward hacking. The project offers solutions by doing the following:
该项目的代码可以在以下链接:https://github.com/allenai/RL4LMs。
The code for this project can be found at the following link: https://github.com/allenai/RL4LMs.
在本章中,我们深入研究了最先进的LLMs的动态和复杂的世界。我们讨论了它们卓越的泛化能力,使其成为适用于各种任务的多功能工具。我们还强调了理解复杂上下文的关键方面,这些模型通过掌握语言的细微差别和各种主题的复杂性而表现出色。
In this chapter, we’ve delved into the dynamic and complex world of state-of-the-art LLMs. We’ve discussed their remarkable generalization capabilities, making them versatile tools for a wide range of tasks. We also highlighted the crucial aspect of understanding complex contexts, where these models excel by grasping the nuances of language and the intricacies of various subject matters.
此外,我们还探讨了 RLHF 的范式以及如何利用它来增强 LM。 RLHF 利用标量反馈通过模仿人类判断来改进 LM,从而有助于减轻NLP 任务中遇到的一些常见陷阱。
Additionally, we explored the paradigm of RLHF and how it is being employed to enhance LMs. RLHF leverages scalar feedback to improve LMs by mimicking human judgments, thereby helping to mitigate some of the common pitfalls encountered in NLP tasks.
我们讨论了使用这些模型的技术要求,强调需要 Transformer、强化学习和编码技能等领域的基础知识。
We discussed technical requirements for working with these models, emphasizing the need for foundational knowledge in areas such as Transformers, reinforcement learning, and coding skills.
本章还涉及了一些著名的 LM,例如 GPT-4 和 LLaMA,讨论了它们的架构、方法和性能。我们重点介绍了一些库用来解释 LM 预测的策略,例如删除某些单词和分析梯度变化。
This chapter also touched on some prominent LMs such as GPT-4 and LLaMA, discussing their architecture, methods, and performance. We highlighted the strategies some libraries employ to interpret LM predictions, such as the removal of certain words and analyzing gradient changes.
总而言之,本章全面概述了LLMs的现状,探讨了它们的能力、挑战、改进它们的方法,以及用于评估和解释的不断发展的工具和措施。
To sum up, this chapter offers a comprehensive overview of the current state of LLMs, exploring their capabilities, challenges, the methods used to refine them, and the evolving tools and measures for their evaluation and interpretation.
在这个人工智能( AI ) 和机器学习( ML )充满活力的时代,了解大量可用资源并学习如何有效利用它们至关重要。 GPT-4 等大型语言模型( LLM )在从内容生成到复杂问题解决等多种任务中展现出前所未有的性能,彻底改变了自然语言处理( NLP )领域。它们的巨大潜力不仅可以理解和生成类似人类的文本,还可以在通信和任务自动化方面弥合机器和人类之间的差距。拥抱LLMs的实际应用可以使企业、研究人员和开发人员能够创建更直观、智能和高效的系统,以满足广泛的需求。本章提供了设置 LLM 访问权限的指南,引导您使用它们并使用它们构建管道。
In this dynamic era of Artificial Intelligence (AI) and Machine Learning (ML), understanding the vast assortment of available resources and learning how to utilize them effectively is vital. Large Language Models (LLMs) such as GPT-4 have revolutionized the field of Natural Language Processing (NLP) by showcasing unprecedented performance in diverse tasks, from content generation to complex problem-solving. Their immense potential extends not only to understanding and generating human-like text but also to bridging the gap between machines and humans, in terms of communication and task automation. Embracing the practical applications of LLMs can empower businesses, researchers, and developers to create more intuitive, intelligent, and efficient systems that cater to a wide range of requirements. This chapter offers a guide to setting up access to LLMs, walking you through using them and building pipelines with them.
我们的旅程从深入研究利用应用程序编程接口( API ) 的闭源模型开始,以 OpenAI 的 API 作为典型示例。我们将引导您完成一个实际场景,说明如何使用 Python 代码中的 API 密钥与此 API 进行交互,展示此类模型在现实环境中的潜在应用。
Our journey begins by delving into closed source models that utilize Application Programming Interfaces (APIs), taking OpenAI’s API as a quintessential example. We will walk you through a practical scenario, illustrating how you can interact with this API using an API key within your Python code, demonstrating the potential applications of such models in real-world contexts.
随着我们的进步,我们将把重点转移到开源工具领域,为您提供可通过 Python 操作的广泛使用的开源模型的概要。我们的目标是掌握这些模型提供的强大功能和多功能性,强调开源开发的社区驱动的好处。
As we advance, we will shift our focus to the realm of open source tools, giving you a rundown of widely employed open source models that can be manipulated via Python. We aim to provide a grasp of the power and versatility these models provide, emphasizing the community-driven benefits of open source development.
随后,我们将向您介绍检索增强生成,特别是 LangChain,这是一个专门为与LLMs交互而设计的强大工具。LangChain对于LLM的实际应用至关重要,因为它为LLM提供了统一和抽象的接口,以及一套简化LLM支持的应用程序的开发和部署的工具和模块。我们将引导您了解 LangChain 的基本概念,重点介绍其独特的方法来规避LLMs带来的固有挑战。
Subsequently, we will introduce you to retrieval-augmented generation and, specifically, LangChain, a robust tool specifically engineered for interaction with LLMs. LangChain is essential for the practical application of LLMs because it provides a unified and abstracted interface to them, as well as a suite of tools and modules that simplify the development and deployment of LLM-powered applications. We’ll guide you through the foundational concept of LangChain, highlighting its distinctive methodology to circumvent the inherent challenges posed by LLMs.
这种方法的基石是将数据转换为嵌入。我们将阐明语言模型( LM ) 和 LLM 在这一转变中发挥的关键作用,展示它们如何参与创建这些嵌入。接下来,我们将讨论建立本地向量数据库的过程,向您简要概述向量数据库及其在管理和检索这些嵌入中的关键作用。
The cornerstone of this methodology is the transformation of data into embeddings. We will shed light on the pivotal role that Language Models (LMs) and LLMs play in this transformation, demonstrating how they are engaged in creating these embeddings. Following this, we will discuss the process of establishing a local vector database, giving you a brief overview of vector databases and their crucial role in managing and retrieving these embeddings.
然后,我们将解决用于提示的 LLM 的配置,这可能与用于嵌入过程的 LLM 相同。我们将引导您完成逐步设置过程,详细介绍该策略的优势和潜在应用。
Then, we will address the configuration of an LLM for prompting, which could potentially be the same LLM used for the embedding process. We will take you through the stepwise setup procedure, detailing the advantages and potential applications of this strategy.
在倒数第二部分,我们将讨论将LLMs部署到云的主题。云服务的可扩展性和成本效益导致托管人工智能模型的采用越来越多。我们将概述领先的云服务提供商,包括Microsoft Azure、Amazon Web Services ( AWS ) 和Google Cloud Platform ( GCP ),让您深入了解他们的服务产品以及如何利用它们进行LLM 部署。
In the penultimate segment, we will touch upon the topic of deploying LLMs to the cloud. The scalability and cost-effectiveness of cloud services have led to an increased adoption of hosting AI models. We will provide an overview of the leading cloud service providers, including Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform (GCP), giving you insights into their service offerings and how they can be harnessed for LLM deployment.
当我们开始对LLMs的探索时,认识到这些模型所运行的不断变化的数据环境至关重要。数据的动态性质——其数量、多样性和复杂性的增长——需要我们在开发、部署和维护LLMs方面采取前瞻性的方法。在后续章节中,特别是第 10 章,我们将更深入地探讨这些不断变化的数据环境的战略影响,帮助您做好应对它们带来的挑战和机遇的准备。这种基本的理解不仅可以增强您与LLMs的直接合作,还可以确保您的项目在面对快速的技术和数据驱动的变化时保持弹性和相关性。
As we embark on this exploration of LLMs, it’s crucial to acknowledge the continuously evolving data landscape that these models operate within. The dynamic nature of data – its growth in volume, diversity, and complexity – necessitates a forward-looking approach to how we develop, deploy, and maintain LLMs. In the subsequent chapters, particularly Chapter 10, we will delve deeper into the strategic implications of these evolving data landscapes, preparing you to navigate the challenges and opportunities they present. This foundational understanding will not only enhance your immediate work with LLMs but also ensure your projects remain resilient and relevant in the face of rapid technological and data-driven changes.
让我们回顾一下本章涵盖的主要主题:
Let’s go through the main topics covered in the chapter:
对于本章,需要以下内容:
For this chapter, the following will be necessary:
现在我们已经掌握了LLMs的变革潜力和各种可用的工具,让我们更深入地研究并探索如何使用 API 有效地设置LLMs申请。
Now that we’ve grasped the transformative potential of LLMs and the variety of tools available, let’s delve deeper and explore how to effectively set up LLM applications using APIs.
当寻求使用一般模型,特别是LLMs时,存在各种设计选择和权衡。一把钥匙选择是在本地环境中本地托管模型还是远程使用它,通过通信通道访问它。本地开发环境将是您的代码运行的任何地方,无论是您的个人计算机、本地服务器、云环境等等。您所做的选择将影响许多方面,例如成本、信息安全、维护需求、网络过载和推理速度。
When looking to employ models in general and LLMs in particular, there are various design choices and trade-offs. One key choice is whether to host a model locally in your local environment or to employ it remotely, accessing it via a communication channel. Local development environments would be wherever your code runs, whether that’s your personal computer, your on-premises server, your cloud environment, and so on. The choice you make will impact many aspects, such as cost, information security, maintenance needs, network overload, and inference speed.
在本节中,我们将介绍一种通过 API 远程使用 LLM 的快速而简单的方法。这种方法快速而简单,因为它使我们无需分配不寻常的计算资源来在本地托管 LLM。LLMs通常需要大量的内存和计算资源,这在个人环境中并不常见。
In this section, we will introduce a quick and simple approach to employing an LLM remotely via an API. This approach is quick and simple as it rids us of the need to allocate unusual computation resources to host the LLM locally. An LLM typically requires amounts of memory and computation resources that aren’t common in personal environments.
在深入实施之前,我们需要选择一个符合我们项目要求的合适的LLM提供商。例如,OpenAI 提供了多个版本的 GPT-3.5 和 GPT-4 模型以及全面的API 文档。
Before diving into implementation, we need to select a suitable LLM provider that aligns with our project requirements. OpenAI, for example, offers several versions of the GPT-3.5 and GPT-4 models with comprehensive API documentation.
为了访问 OpenAI 的 LLM API,我们需要在他们的网站上创建一个帐户。此过程涉及注册、帐户验证和获取API 凭证。
To gain access to OpenAI’s LLM API, we need to create an account on their website. This process involves registration, account verification, and obtaining API credentials.
OpenAI 的网站提供了这些常见操作的指导,您将能够快速进行设置y。
OpenAI’s website provides guidance for these common actions, and you will be able to get set up quickly.
注册完成后,我们应该熟悉OpenAI的API文档。本文档将指导我们了解可用于与LLM交互的各种端点、方法和参数。
Once registered, we should familiarize ourselves with OpenAI’s API documentation. This documentation will guide us through the various endpoints, methods, and parameters available to interact with the LLMs.
我们将进行的第一次实践经验将是e 通过 Python 使用 OpenAI 的LLMs。我们整理了一个笔记本介绍了通过 API 使用 OpenAI 的 GPT 模型的简单步骤。请参阅Ch8_Setting_Up_Close_Source_and_Open_Source_LLMs.ipynb笔记本。该笔记本名为“设置闭源和开源 LLM”,将在当前有关 OpenAI 的 API 的部分以及有关设置本地 LLM 的后续部分中使用。
The first hands-on experience we will take on will be employing OpenAI’s LLMs via Python. We have put together a notebook that presents the simple steps of employing OpenAI’s GPT model via an API. Refer to the Ch8_Setting_Up_Close_Source_and_Open_Source_LLMs.ipynb notebook. This notebook, called Setting Up Close Source and Open Source LLMs, will be utilized in the current section about OpenAI’s API, and also in the subsequent section about setting up local LLMs.
让我们看一下代码:
Let’s walk through the code:
!pip install --升级操作恩奈
!pip install --upgrade openai
openai.api_key = "<你的密钥>"
openai.api_key = "<your key>"由于通过 API 连接到 LLM 已经奠定了基础,因此将我们的注意力转移到同样重要的方面是有价值的 - 及时的工程和启动,与这些模式有效沟通的艺术LS。
As the foundation is set for connecting to LLMs through APIs, it’s valuable to turn our attention to an equally important aspect – prompt engineering and priming, the art of effectively communicating with these models.
让我们暂停并提供一些上下文,然后再返回讨论代码的下一部分。
Let us pause and provide some context before returning to discuss the next part of the code.
快速工程是一种技术在 NLP 中用于设计与LLMs互动时有效的提示或说明。它涉及精心设计模型的输入以得出所需的输出。通过在提示中提供特定的线索、上下文或约束,提示工程旨在指导模型的行为并鼓励生成更准确、相关或有针对性的响应。该过程通常涉及迭代细化、实验以及了解模型的优点和局限性,以优化提示以提高各种任务(例如问答总结或对话生成)的性能。有效的即时工程在利用 LM 的功能并调整其输出以满足特定用户需求方面发挥着至关重要的作用。
Prompt engineering is a technique used in NLP to design effective prompts or instructions when interacting with LLMs. It involves carefully crafting the input given to a model to elicit the desired output. By providing specific cues, context, or constraints in the prompts, prompt engineering aims to guide the model’s behavior and encourage the generation of more accurate, relevant, or targeted responses. The process often involves iterative refinement, experimentation, and understanding the model’s strengths and limitations to optimize the prompt for improved performance in various tasks, such as question-answering summarization or conversation generation. Effective prompt engineering plays a vital role in harnessing the capabilities of LMs and shaping their output to meet specific user requirements.
让我们回顾一下即时工程中最有影响力的工具之一——启动。通过 API 启动 GPT 涉及在生成响应之前向模型提供初始上下文。启动步骤有助于设置生成内容的方向和风格。通过为模型提供与所需输出相关的信息或示例,我们可以指导其理解并鼓励更加集中和连贯的响应。启动可以通过包含特定的说明、上下文甚至与期望结果一致的部分句子来完成。有效的启动增强了模型生成更符合用户意图或特定要求的响应的能力。
Let’s review one of the most impactful tools in prompt engineering, priming. Priming GPT via an API involves providing initial context to the model before generating a response. The priming step helps set the direction and style of the generated content. By giving the model relevant information or examples related to the desired output, we can guide its understanding and encourage more focused and coherent responses. Priming can be done by including specific instructions, context, or even partial sentences that align with the desired outcome. Effective priming enhances the model’s ability to generate responses that better match the user’s intent or specific requirements.
Priming is done by introducing GPT with several types of messages:
响应 = client.chat.completions.create( 型号=“gpt-3.5-turbo”, 消息=[ {“角色”:“系统”, "content": "你是一个有用的助手。你提供简短的答案并用 Markdown 语法格式化它们"}, {“角色”:“用户”, "content": "如何导入Python库pandas?"}, {“角色”:“助理”, "content": "这是导入 pandas 的方式:\n```\nimport pandas as pd\n```"}, {“角色”:“用户”, "content": "如何导入 python 库 numpy?"} ]) 文本=response.choices[0].message.content.strip() 打印(文本) )
要导入 numpy,可以使用以下语法:
````蟒蛇
将 numpy 导入为 np
````For example, observe this priming code:
response = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "system", "content": "You are a helpful assistant. You provide short answers and you format them in Markdown syntax"}, {"role": "user", "content": "How do I import the Python library pandas?"}, {"role": "assistant", "content": "This is how you would import pandas: \n```\nimport pandas as pd\n```"}, {"role": "user", "content": "How do I import the python library numpy?"} ]) text = response.choices[0].message.content.strip() print(text) )
To import numpy, you can use the following syntax:
```python
import numpy as np
```您可以看到我们对模型进行了准备,以 Markdown 格式提供简洁的答案。用于教授模型的示例采用问题和答案的形式。问题是通过用户提示提出的,而我们告诉模型潜在答案是什么的方式是通过助理提示提供的。然后我们为模型提供另一个用户提示;这是我们希望模型为我们解决的实际提示,如输出所示,它是正确的。
You can see that we prime the model to provide concise answers in a Markdown format. The example that is used to teach the model is in the form of a question and an answer. The question is via a user prompt, and the way we tell the model what the potential answer is is provided via an assistant prompt. We then provide the model with another user prompt; this one is the actual prompt we’d like the model to address for us, and as shown in the output, it gets it right.
通过查看 OpenAI 关于提示工程的文档,您会发现还有其他类型的提示可以为GPT模型做好准备。
By looking at OpenAI’s documentation about prompt engineering, you’ll find that there are additional types of prompts to prime the GPT models with.
回到我们的笔记本和代码,在本节中,我们利用GPT-3.5 Turbo。我们以最简单的方式启动它,只给它一个系统提示来提供指示,以展示附加功能如何从系统提示中产生。我们告诉模型通过提醒我们提示中的拼写错误并更正它们来完成响应。
Going back to our notebook and code, in this section, we leverage GPT-3.5 Turbo. We prime it in the simplest manner, only giving it a system prompt to provide directions in order to showcase how additional functionality could stem from the system prompt. We tell the model to finish a response by alerting us about typos in the prompt and correcting them.
然后,我们在用户提示部分提供所需的提示,并插入一些类型操作系统进入其中。运行该代码并尝试一下。
We then provide our desired prompt in the user prompt section, and we insert a few typos into it. Run that code and give it a shot.
At this stage, we send our prompts to the model.
以下简单的示例代码在设置闭源和开源 LLM笔记本中运行一次。您可以将其包装在一个函数中,并在您自己的代码中重复调用它。
The following simple example code is run once in the Setting Up Close Source and Open Source LLMs notebook. You can wrap it in a function and call it repeatedly in your own code.
值得注意的一些方面如下:
Some aspects worth noticing are as follows:
print(f"提示:{user_prompt_oai}\n\n{openai_model}的响应:\n{response_oai}")
print(f"Prompt: {user_prompt_oai}\n\n{openai_model}'s Response: \n{response_oai}")
except 异常作为输出:
尝试 += 1
如果尝试 >= max_attempts:
[…]
except Exception as output:
attempts += 1
if attempts >= max_attempts:
[…]上述代码的结果演示如下:
提示:如果神经科学能够提取一个人染色前的最后想法,世界将会有何不同? gpt-3.5-turbo 的响应: 如果神经科学能够提取一个人死前的最后想法,它将对社会的各个方面产生深远的影响。 这种能力可能会彻底改变心理学、犯罪学和临终关怀等领域。 了解一个人的最终想法可以为他们的精神状态、情绪健康提供有价值的见解,并可能有助于解开围绕他们死亡的谜团。 它还可以通过让人们了解死者内心深处的想法来为亲人带来安慰。 然而,此类技术会引起有关隐私、同意以及这些信息的潜在滥用的重大道德问题。 总的来说,全世界都会对这种突破性能力的影响既着迷又担心。 提示中的拼写错误: 1. “染”应为“死” 2. “不同”应该是“不同” 更正: 如果神经科学能够提取一个人死前的最后想法,世界将会有何不同?
The result of the preceding code is demonstrated as follows:
Prompt: If neuroscience could extract the last thoughts a person had before they dyed, how would the world be different? gpt-3.5-turbo's Response: If neuroscience could extract the last thoughts a person had before they died, it would have profound implications for various aspects of society. This ability could potentially revolutionize fields such as psychology, criminology, and end-of-life care. Understanding a person's final thoughts could provide valuable insights into their state of mind, emotional well-being, and potentially help unravel mysteries surrounding their death. It could also offer comfort to loved ones by providing a glimpse into the innermost thoughts of the deceased. However, such technology would raise significant ethical concerns regarding privacy, consent, and the potential misuse of this information. Overall, the world would be both fascinated and apprehensive about the implications of this groundbreaking capability. Typos in the prompt: 1. "dyed" should be "died" 2. "diferent" should be "different" Corrections: If neuroscience could extract the last thoughts a person had before they died, how would the world be different?
该模型为我们提供了合理、简洁的回应。然后它通知我们有关拼写错误的信息,这与我们提供的系统提示完全一致。
The model provided us with a legitimate, concise response. It then notified us about the typos, which are perfectly in line with the system prompt we provided it with.
这是一个例子展示远程、异地、闭源LLMs的就业情况。虽然利用 OpenAI 等付费 API 的强大功能可以提供便利和尖端性能,但利用免费的开源LLMs也具有巨大的潜力。让我们来探索一下这些具有成本效益的替代方案接下来是本地人。
That was an example showcasing the employment of a remote, off-premises, closed source LLM. While leveraging the power of paid APIs such as OpenAI offers convenience and cutting-edge performance, there’s also immense potential in tapping into free open source LLMs. Let’s explore these cost-effective alternatives next.
Now, we shall touch on the complementary approach to a closed source implementation, that is, an open source, local implementation.
我们将了解如何实现与我们在上一节中回顾的类似的功能结果,而无需注册帐户、付款或与第三方供应商(例如OpenAI)共享包含可能敏感信息的提示。
We will see how you can achieve a similar functional outcome to the one we reviewed in the previous section, without having to register for an account, pay, or share prompts that contain possibly sensitive information with a third-party vendor, such as OpenAI.
当选择之间开源 LLM(例如 LLaMA 和 GPT-J)以及闭源、基于 API 的模型(例如 OpenAI 的 GPT),必须考虑几个关键因素。
When selecting between open source LLMs, such as LLaMA and GPT-J, and closed source, API-based models such as OpenAI’s GPT, several critical factors must be considered.
首先,成本是一个主要因素。开源LLMs通常没有许可费用,但它们需要大量的计算资源来进行训练和推理,这可能会很昂贵。闭源模型虽然可能需要订阅或按使用付费,但无需大量硬件投资。
Firstly, cost is a major factor. Open source LLMs often have no licensing fees, but they require significant computational resources for training and inference, which can be expensive. Closed source models, while potentially carrying a subscription or pay-per-use fee, eliminate the need for substantial hardware investments.
处理速度和维护与计算资源密切相关。开源LLMs如果部署在足够强大的系统上,可以提供很高的处理速度,但需要实施团队的持续维护和更新。相比之下,由提供商管理的闭源模型可确保持续维护和模型更新,通常具有更高的效率并减少停机时间,但处理速度可能取决于提供商的基础设施和网络延迟。
Processing speed and maintenance are closely linked to computational resources. Open source LLMs, if deployed on powerful enough systems, can offer high processing speeds but require ongoing maintenance and updates by the implementing team. In contrast, closed source models managed by the provider ensure continual maintenance and model updates, often with better efficiency and reduced downtime, but processing speed can be dependent on the provider’s infrastructure and network latency.
关于模型更新,开源模型提供了更多控制,但需要采取积极主动的方法来纳入最新的研究和改进。然而,闭源模型由提供商定期更新,确保无需访问即可获得最新进展。用户的额外努力。
Regarding model updates, open source models offer more control but require a proactive approach to incorporate the latest research and improvements. Closed source models, however, are regularly updated by the provider, ensuring access to the latest advancements without additional effort from the user.
在这两种情况下,安全和隐私都是至关重要的。开源模型可以更安全,因为它们可以在私有服务器上运行,确保数据隐私。然而,他们需要强大的内部安全协议。由外部提供商管理的闭源模型通常具有内置的安全措施,但由于第三方处理数据而带来潜在的隐私风险。
Security and privacy are paramount in both scenarios. Open source models can be more secure, as they can be run on private servers, ensuring data privacy. However, they demand robust in-house security protocols. Closed source models, managed by external providers, often come with built-in security measures but pose potential privacy risks, due to data handling by third parties.
总体而言,开源LLMs和闭源LLMs之间的选择取决于成本、控制和便利性之间的权衡,每种选择都有自己的优势和挑战。
Overall, the choice between open source and closed source LLMs hinges on the trade-off between cost, control, and convenience, with each option presenting its own set of advantages and challenges.
考虑到这一点,让我们回顾一下 Hugging Face,该公司为免费 LM 建立了最大、最平易近人的中心。在下面的示例中,我们将利用 Hugging Face 的简单且免费的库元:变形金刚。
With that in mind, let’s revisit Hugging Face, the company that put together the largest and most approachable hub for free LMs. In the following example, we will leverage Hugging Face’s easy and free library: transformers.
当我们为我们的任务选择LLMs时,我们建议您参考 Hugging Face 的模特在线页面。他们提供了巨大的大量基于 Python 的开源LLMs。每个模型都有一个专门的页面,您可以在其中找到有关该模型的信息,包括在您的个人环境中通过 Python 代码使用该模型所需的语法。
When looking to choose an LLM for our task, we recommend referring to Hugging Face’s Models online page. They offer an enormous amount of Python-based, open source LLMs. Every model has a page dedicated to it, where you can find information about it, including the syntax needed to employ that model via Python code in your personal environment.
应该注意的是,为了在本地实现模型,您必须具有运行 Python 代码的计算机的互联网连接。然而,由于这一要求在某些情况下可能会成为瓶颈——例如,当开发环境受到公司内网的限制或由于防火墙限制而导致互联网访问受限时——因此有其他方法。我们推荐的方法是从 Hugging Face 的域克隆模型存储库。这是一种不那么琐碎且较少使用的方法。Hugging Face 为每个人提供了必要的克隆命令模型的网页。
It should be noted that in order to implement a model locally, you must have an internet connection from the machine that runs the Python code. However, as this requirement may become a bottleneck in some cases – for instance, when the development environment is restricted by a company’s intranet or has limited internet access due to firewall restrictions – there are alternative approaches. Our recommended approach is to clone the model repository from Hugging Face’s domain. That is a less trivial and less-used approach. Hugging Face provides the necessary cloning commands on each model’s web page.
在选择模型时,可能有几个因素会发挥作用。根据您的意图,您可能关心配置速度、处理速度、存储空间、计算资源、合法使用限制等等。另一个值得注意的因素是模型的受欢迎程度。它证明了该模型被其他人选择的频率社区中的开发者。例如,如果您寻找标记为零样本分类的 LM,您会发现大量可用模型。但是,如果您进一步缩小搜索范围,只保留根据新闻文章数据进行训练的模型,那么您将得到一组小得多的可用模型。在这种情况下,您可能需要参考每个模型的受欢迎程度,并从最常用的模型开始探索。
When looking to choose a model, there may be several factors that come into play. Depending on your intentions, you may care about configuration speed, processing speed, storage space, computation resources, legal usage restrictions, and so on. Another factor worth noting is the popularity of a model. It attests to how frequently that model is chosen by other developers in the community. For instance, if you look for LMs that are labeled for zero-shot classification, you will find a very large collection of available models. But, if you then narrow the search some more so to only be left with models that were trained on data from news articles, you would be left with a much smaller set of available models. In which case, you may want to refer to the popularity of each model and start your exploration with the model that was used the most.
您可能感兴趣的其他因素可能是有关模型的出版物、模型的开发人员、发布模型的公司或大学、模型训练所用的数据集、模型设计的架构、评估指标以及其他潜在的因素Hugg 上每个模特的网页上可能提供的因素ing Face 的网站。
Other factors that may interest you could be publications about the model, the model’s developers, the company or university that released the model, the dataset that the model was trained on, the architecture the model was designed by, the evaluation metrics, and other potential factors that may be available on each model’s web page on Hugging Face’s website.
现在,我们将回顾一个代码笔记本,该笔记本举例说明了使用本地实现开源 LLMHugging Face 的免费资源。我们将继续使用上一节中的相同笔记本,设置关闭源代码和开源LLMs:
Now, we will review a code notebook that exemplifies implementing an open source LLM locally using Hugging Face’s free resources. We will continue with the same notebook from the previous section, Setting Up Close Source and Open Source LLMs:
通过终端上的pip ,我们将运行以下命令:
pip install – 升级TransformerVia pip on the Terminal, we will run the following:
pip install –upgrade transformersAlternatively, if running directly from a Jupyter notebook, add ! to the beginning of the command.
hf_model =“microsoft/DialoGPT-medium” max_length = 1000 tokenizer = AutoTokenizer.from_pretrained(hf_model) 模型 = AutoModelForCausalLM.from_pretrained(hf_model)
请注意,此代码需要访问互联网。即使模型部署在本地,也需要互联网连接才能导入。同样,如果您愿意,您可以从 Hugging Face 克隆模型的存储库,然后不再需要拥有 ac访问互联网。
hf_model = "microsoft/DialoGPT-medium" max_length = 1000 tokenizer = AutoTokenizer.from_pretrained(hf_model) model = AutoModelForCausalLM.from_pretrained(hf_model)
Note that this code requires access to the internet. Even though the model is deployed locally, an internet connection is required to import it. Again, if you wish, you can clone the model’s repo from Hugging Face and then no longer be required to have access to the internet.
microsoft/DialoGPT-medium 的回应:
我认为他们会更害怕人类
microsoft/DialoGPT-medium's Response:
I think they would be more afraid of the humans本节确立了LLMs可以带来的巨大价值主张。我们现在已经有了必要的背景学习和探索高效LLMs应用程序开发的新领域——使用LangChain等工具构建管道。让我们深入了解一下他的先进方法。
This section established the tremendous value proposition that LLMs can bring. We now have the necessary background to learn and explore a new frontier in efficient LLM application development – constructing pipelines using tools such as LangChain. Let’s dive into this advanced approach.
检索增强生成( RAG ) 是一个开发框架,旨在与LLMs无缝交互。LLMs凭借其通才性质,能够胜任执行大量任务。然而,它们的普遍性常常使他们无法对需要某一领域的专业知识或深入专业知识的查询提供详细、细致的答复。例如,如果您渴望使用LLMs来解决有关特定学科(例如法律或医学)的问题,它可能会令人满意地回答一般问题,但无法准确回答那些需要详细见解或最新知识的问题。
Retrieval-Augmented Generation (RAG) is a development framework designed for seamless interaction with LLMs. LLMs, by virtue of their generalist nature, are capable of performing a vast array of tasks competently. However, their generality often precludes them from delivering detailed, nuanced responses to queries that necessitate specialized knowledge or in-depth expertise in a domain. For instance, if you aspire to use an LLM to address queries concerning a specific discipline, such as law or medicine, it might satisfactorily answer general queries but fail to respond accurately to those needing detailed insights or up-to-date knowledge.
RAG 设计针对 LLM 处理中通常遇到的限制提供了全面的解决方案。在 RAG 框架中,文本语料库经过初始预处理,被分割成摘要或不同的块,然后嵌入向量空间中。进行查询时,模型会识别该数据中最相关的部分,并利用它们来形成响应。该过程涉及离线数据预处理、在线信息检索以及应用 LLM 进行响应生成的组合。它是一种多功能方法,可以适应各种任务,包括代码生成和语义搜索。RAG 模型充当协调这些流程的抽象层。这种方法的功效不断提高,其应用随着LLMs的发展而不断扩大,并且在快速处理过程中需要更多上下文丰富的数据。在第 10 章中,我们将更深入地讨论 RAG 模型及其在未来LLM 解决方案中的作用。
RAG designs offer a comprehensive solution to the limitations typically encountered in LLM processing. In a RAG framework, the text corpus undergoes initial preprocessing, where it’s segmented into summaries or distinct chunks and then embedded within a vector space. When a query is made, the model identifies the most relevant segments of this data and utilizes them to form a response. This process involves a combination of offline data preprocessing, online information retrieval, and the application of the LLM for response generation. It’s a versatile approach that can be adapted to a variety of tasks, including code generation and semantic search. RAG models function as an abstraction layer that orchestrates these processes. The efficacy of this method is continually increasing, with its applications expanding as LLMs evolve and require more contextually rich data during prompt processing. In Chapter 10, we will present a deeper discussion of RAG models and their role in the future of LLM solutions.
现在我们已经介绍了 RAG 模型的前提和功能,让我们关注一个特定的例子,称为 LangChain。我们将回顾其设计原则的具体细节以及它如何与数据源交互。
Now that we’ve introduced the premise and capabilities of RAG models, let’s focus on one particular example, called LangChain. We will review the nuts and bolts of its design principles and how it interfaces with data sources.
在本节中,我们将剖析浪链脱颖而出的核心方法论和架构决策。这会让我们深入了解其结构框架、数据处理效率以及将LLMs与各种数据源集成的创新方法。
In this section, we will dissect the core methodologies and architectural decisions that make LangChain stand out. This will give us insights into its structural framework, the efficiency of data handling, and its innovative approach to integrating LLMs with various data sources.
最重要的之一LangChain 的优点是能够将任意 LLM 连接到定义的数据源。所谓任意,我们的意思是它可以是任何现成的LLMs,其设计和训练时没有具体考虑我们想要将其连接到的数据。使用 LangChain 可以让我们根据自己的领域进行定制。在构造用户提示的答案时,将使用数据源作为参考。该数据可能是公司拥有的专有数据或您个人计算机上的本地个人信息。
One of the most significant virtues of LangChain is the ability to connect an arbitrary LLM to a defined data source. By arbitrary, we mean that it could be any off-the-shelf LLM that was designed and trained with no specific regard to the data we are looking to connect it to. Employing LangChain allows us to customize it to our domain. The data source is to be used for reference when structuring the answer to the user prompt. That data may be proprietary data owned by a company or local personal information on your personal machine.
然而,当涉及到利用给定的数据库时,LangChain 所做的不仅仅是将 LLM 指向数据;它采用特殊的处理方案,使其快速高效。它创建一个矢量数据库。
However, when it comes to leveraging a given database, LangChain does more than point the LLM to the data; it employs a particular processing scheme and makes it quick and efficient. It creates a vector database.
给定原始文本数据(无论是.txt文件中的自由文本、格式化文件还是其他各种文本数据结构),通过使用指定模型将文本分块为适当的长度并创建数字文本嵌入来创建矢量数据库。请注意,如果指定的嵌入模型被选择为 LLM,则它不必与用于提示的 LLM 相同。例如,嵌入模型可以选择为免费的、次优的开源LLMs,而提示模型可以选择为具有最佳性能的付费LLMs。然后将这些嵌入存储在矢量数据库中。您可以清楚地看到,这种方法的存储效率极高,因为我们将文本(可能还有编码文本)转换为一组有限的数值,而这些数值本质上是密集的。
Given raw text data, be it free text in a .txt file, formatted files, or other various data structures of text, a vector database is created by chunking the text into appropriate lengths and creating numerical text embeddings, using a designated model. Note that if the designated embedding model is chosen to be an LLM, it doesn’t have to be the same LLM that is used for prompting. For instance, the embedding model could be picked to be a free, sub-optimal, open source LLM, and the prompting model could be a paid LLM with optimal performance. Those embeddings are then stored in a vector database. You can clearly see that this approach is extremely storage-efficient, as we transform text, and perhaps encoded text, into a finite set of numerical values, which by its nature is dense.
当用户输入提示时,搜索机制会识别嵌入数据源中的相关数据块。提示使用相同的指定嵌入模型进行嵌入。然后,搜索机制应用相似性度量(例如余弦相似性),并在定义的数据源中找到最相似的文本块。然后,检索这些块的原始文本。然后再次发送原始提示,这次发送给提示的LLMs。不同的是,这一次的提示不仅仅包括原始用户的提示;它还包含作为参考的检索到的文本。这使得LLMs能够获得问题和富文本补充以供参考。然后LLMs可以参考添加的信息作为参考。
When a user enters a prompt, a search mechanism identifies the relevant data chunks in the embedded data source. The prompt gets embedded with the same designated embedding model. Then, the search mechanism applies a similarity metric, such as cosine similarity, for example, and finds the most similar text chunks in the defined data source. Then, the original text of these chunks is retrieved. The original prompt is then sent again, this time to the prompting LLM. The difference is that, this time, the prompt consists of more than just the original user’s prompt; it also consists of the retrieved text as a reference. This enables the LLM to get a question and a rich text supplement for reference. The LLM then can refer to the added information as a reference.
如果没有这种设计,当用户想要找到问题的答案时,他们将需要阅读大量材料并找到相关部分。例如,该材料可能是公司的整个产品方法论,由许多 PDF 文档组成。这该过程利用自动智能搜索机制,将相关材料缩小到适合提示的文本量。然后,LLMs会给出问题的答案并立即呈现给用户。如果您愿意,可以将管道设计为引用用于构建答案的原始文本,从而实现透明度和验证。
If it weren’t for this design, when the user wanted to find an answer to their question, they would need to read through the vast material and find the relevant section. For instance, the material may be a company’s entire product methodology, consisting of many PDF documents. This process leverages an automated smart search mechanism that narrows the relevant material down to an amount of text that can fit into a prompt. Then, the LLM frames the answer to the question and presents it to the user immediately. If you wish, the pipeline can be designed to quote the original text that it used to frame the answer, thus allowing for transparency and verification.
这种范例如图 8 .1所示:
This paradigm is portrayed in Figure 8.1:
。
.
图 8.1 – 典型的 LangChain 管道范例
Figure 8.1 – The paradigm of a typical LangChain pipeline
为了解释浪链管道背后的快速工程,让我们回顾一个金融信息用例。您的数据源是一组美国上市公司向美国证券交易委员会 ( SEC )提交的文件。你在寻找 确定向股东发放股息的公司以及在哪一年。
In order to explain the prompt engineering behind the LangChain pipeline, let’s review a financial information use case. Your data source is a cohort of Securities & Exchange Commission (SEC) filings of public companies from the US. You are looking to identify companies that gave dividends to their stock holders, and in what year.
您的提示将如下所示:
Your prompt would be as follows:
哪些文件提到该公司在 2023 年派发股息?
Which filings mention that the company gave dividends in the year 2023? 然后,管道嵌入该问题并查找具有相似上下文的文本块(例如,讨论已支付股息)。它标识了许多这样的块,例如:
The pipeline then embeds this question and looks for text chunks with similar context (e.g., that discuss paid dividends). It identifies many such chunks, such as the following:
“股息政策。股息由董事会酌情决定支付。在 2023 财年,我们支付的季度现金股息总额为每股 8.79 美元 […]”
"Dividend Policy. Dividends are paid at the discretion of the Board of Directors. In fiscal 2023, we paid aggregate quarterly cash dividends of $8.79 per share […]"
然后,LangChain 管道会形成一个新的提示,其中包含已识别块的文本。在这个例子中,我们假设提示的LLM是OpenAI的GPT。LangChain将信息嵌入到发送给OpenAI的GPT模型的系统提示中:
The LangChain pipeline then forms a new prompt that includes the text of the identified chunks. In this example, we assume the prompted LLM is OpenAI’s GPT. LangChain embeds the information in the system prompt sent to OpenAI’s GPT model:
“提示”:[ “系统:使用以下上下文回答用户的问题。\n如果你不知道答案,就说你不知道,不要试图编造答案。\n---- ------------\n 股息政策。 ]
"prompts": [ "System: Use the following pieces of context to answer the user's question. \nIf you don't know the answer, just say that you don't know, don't try to make up an answer.\n----------------\n Dividend Policy. Dividends are paid at the […]" ]
正如我们所看到的,系统提示用于指示模型如何行动,然后提供上下文。
As we can see, the system prompt is used to instruct the model how to act and then to provide the context.
现在我们已经了解了 LangChain 的基本方法和优势,让我们从它如何有效地将 LLM 与不同数据源连接起来开始,深入探讨其复杂的设计理念。
Now that we have an understanding of the foundational approach and benefits of LangChain, let’s go deeper into its intricate design concepts, starting with how it bridges LLMs to diverse data sources efficiently.
虽然前面的描述是数据已预处理为矢量数据库的形式,另一种方法是设置对尚未处理为嵌入形式的外部数据源的访问。例如,您可能希望利用 SQL 数据库来补充其他数据源。这种方法称为多检索源。
While the preceding description is of data that is preprocessed to take the form of a vector database, another approach is to set up access to external data sources that are not yet processed into an embedding form. For instance, you may wish to leverage a SQL database to supplement other data sources. This approach is referred to as multiple retrieval sources.
我们现在已经探索了LangChain与各种数据源高效对接的方式;现在,必须掌握实现其功能的核心结构要素——链和代理。
We’ve now explored the ways LangChain efficiently interfaces with various data sources; now, it is essential to grasp the core structural elements that enable its functionalities – chains and agents.
内部的原子构建块LangChain被称为组件。典型的组件可以是提示模板、对各种数据源的访问以及对LLMs的访问。当将各种组件组合成一个系统时,我们就形成了一条链。一条链可以代表一个完整的LLM驱动的应用程序。
The atomic building blocks within LangChain are called components. Typical components could be a prompt template, access to various data sources, and access to LLMs. When combining various components to form a system, we form a chain. A chain can represent a complete LLM-driven application.
现在,我们将介绍代理的概念,并通过一个代码示例来展示链和代理如何结合在一起,创建不久前还相当复杂的功能。
We will now present the concept of agents and walk through a code example that showcases how chains and agents come together, creating a capability that would have been quite complex not too long ago.
链的下一层复杂性是代理。代理通过使用链和补充链来利用链进行额外的计算和决策。虽然链可以产生对简单请求提示的响应,但代理将处理该响应并根据规定的逻辑通过进一步的下游处理对其采取行动。
The next layer of complexity over chains is agents. Agents leverage chains by employing them and complementing them with additional calculations and decisions. While a chain may yield a response to a simple request prompt, an agent would process the response and act upon it with further downstream processing based on a prescribed logic.
您可以将代理视为使用我们所说的工具的推理机制。工具通过将LLMs与其他数据或功能连接起来来补充LLMs。
You can view agents as a reasoning mechanism that employs what we call a tool. Tools complement LLMs by connecting them with other data or functions.
鉴于LLMs的典型缺点阻碍了LLMs成为完美的多任务处理者,代理商以规定和受监控的方式使用工具,使他们能够检索必要的信息,将其用作上下文,并使用指定的现有解决方案执行操作。然后,代理观察结果并采用规定的逻辑进行进一步的下游流程。
Given the typical LLM shortcomings that prevent LLMs from being perfect multitaskers, agents employ tools in a prescribed and monitored manner, allowing them to retrieve necessary information, leverage it as context, and execute actions using designated existing solutions. Agents then observe the results and employ the prescribed logic for further downstream processes.
举个例子,假设我们想要计算我们所在地区的平均入门级程序员的薪资轨迹。该任务由三个关键子任务组成——找出平均起薪是多少,确定工资增长的因素(例如,生活成本的变化,或典型的绩效增长),然后进行预测。理想的LLMs应该能够自己完成整个过程,除了连贯的提示之外不需要任何其他东西。然而,考虑到幻觉和有限的训练数据等典型缺点,当前的LLMs将无法将整个过程执行到可以在商业产品中生产的水平。最佳实践是分解它并通过代理监控思维过程。
As an example, assume we want to calculate the salary trajectory for an average entry-level programmer in our area. This task is comprised of three key sub-tasks – finding out what that average starting salary is, identifying the factors for salary growth (e.g., a change in the cost of living, or a typical merit increase), and then projecting onward. An ideal LLM would be able to do the entire process by itself, not requiring anything more than a coherent prompt. However, given the typical shortcomings, such as hallucinations and limited training data, current LLMs would not be able to perform this entire process to a level where it could be productionized within a commercial product. A best practice is to break it down and monitor the thought process via agents.
In its most simple design, this would require the following:
为了举例说明代理方法,让我们回顾一个简单的任务,该任务涉及从网络获取特定详细信息,并使用它来执行计算。
To exemplify the agentic approach, let's review a simple task that involves fetching a particular detail from the web, and using it to perform a calculation.
!pip 安装 openai
!pip 安装维基百科
!pip 安装 langchain
!pip 安装 langchain-openai
!pip install openai
!pip install wikipedia
!pip install langchain
!pip install langchain-openai从 langchain.agents 导入 load_tools、initialize_agent 从 langchain_openai 导入 OpenAI 导入操作系统 os.environ["OPENAI_API_KEY"] = "<您的 API 密钥>" llm = OpenAI(model_name='gpt-3.5-turbo-instruct') 工具 = load_tools(["wikipedia", "llm-math"], llm=llm) 代理=初始化_代理(工具,llm=llm,代理=“零射击反应描述”,详细=真) agent.run("计算出《动物农场》这本书有多少页。然后计算如果我读一页需要两分钟,那么我需要多少分钟才能读完它。")
然后输出如下所示:
> 成品链条。 “我大约需要 224 分钟或 3 小时 44 分钟才能读完《动物农场》。”
from langchain.agents import load_tools, initialize_agent from langchain_openai import OpenAI import os os.environ["OPENAI_API_KEY"] = "<your API key>" llm = OpenAI(model_name='gpt-3.5-turbo-instruct') tools = load_tools(["wikipedia", "llm-math"], llm=llm) agent = initialize_agent(tools, llm=llm, agent="zero-shot-react-description", verbose=True) agent.run("Figure out how many pages are there in the book Animal Farm. Then calculate how many minutes would it take me to read it if it takes me two minutes to read one page.")
The output is then shown as follows:
> Finished chain. 'It would take me approximately 224 minutes or 3 hours and 44 minutes to read Animal Farm.'
请注意,我们没有应用任何修复 LLM 的方法以重现此精确响应。再次运行此代码将产生略有不同的答案。
Note that we didn’t apply any method to fix the LLM to reproduce this exact response. Running this code again will yield a slightly different answer.
在下一章中,我们将更深入地研究几个带有代码的示例。特别是,我们将编写一个多代理框架,其中一组代理正在开展一个联合项目。
In the next chapter, we will dive deeper into several examples with code. In particular, we will program a multi-agent framework, where a team of agents is working on a joint project.
另一个非常重要的概念是长期记忆。我们讨论了 LangChain 如何通过附加额外的数据源来补充LLMs的知识,其中一些数据源可能是是专有的,使其针对特定用例进行高度定制。然而,它仍然缺乏一个非常重要的功能,即参考之前的对话并从中学习的能力。例如,您可以为项目经理设计一名助理。当用户与其交互时,他们最好每天更新工作进度、交互、挑战等。如果助手能够消化所有新积累的知识并维持它,那就最好了。这将允许这样的场景:
Another very important concept is long-term memory. We discussed how LangChain complements an LLM’s knowledge by appending additional data sources, some of which may be proprietary, making it highly customized for a particular use case. However, it still lacks a very important function, the ability to refer to prior conversations and learn from them. For instance, you can design an assistant for a project manager. As the user interacts with it, they would ideally update each day about the progress of the work, the interactions, the challenges, and so on. It would be best if the assistant could digest all that newly accumulated knowledge and sustain it. That would allow for a scenario such as this:
We will touch more on the concept of memory in the next chapter.
保持准确性和相关性由于动态信息环境中的LLMs输出,必须实施持续更新和维护矢量数据库的策略。随着知识库的不断扩展和发展,作为 LLM 响应基础的嵌入也必须如此。结合增量更新技术允许这些数据库在新信息可用时刷新其嵌入,确保LLMs可以提供最准确和最新的响应。
To maintain the accuracy and relevance of LLM outputs in dynamic information environments, it’s imperative to implement strategies for the ongoing update and maintenance of vector databases. As the corpus of knowledge continues to expand and evolve, so too must the embeddings that serve as the foundation for LLM responses. Incorporating techniques for incremental updates allows these databases to refresh their embeddings as new information becomes available, ensuring that the LLMs can provide the most accurate and up-to-date responses.
增量更新涉及定期将最新信息重新嵌入现有数据源。该过程可以自动扫描数据源中的更新,重新嵌入新的或更新的内容,然后将这些刷新的嵌入集成到现有的矢量数据库中,而无需进行彻底的检修。通过这样做,我们确保数据库反映最新的可用知识,从而增强LLMs提供相关且细致入微的答复的能力。
Incremental updates involve periodically re-embedding existing data sources with the latest information. This process can be automated to scan for updates in the data source, re-embed the new or updated content, and then integrate these refreshed embeddings into the existing vector database, without the need for a complete overhaul. By doing so, we ensure that the database reflects the most current knowledge available, enhancing the LLM’s ability to deliver relevant and nuanced responses.
通过不断评估LLMs输出的质量和相关性,自动监控在这个生态系统中发挥着关键作用。这涉及建立跟踪LLMs表现的系统,识别由于信息过时或上下文缺失而可能导致答复不足的领域。当发现此类差距时,监控系统可以触发增量更新过程,确保数据库仍然稳健且准确地反映当前的知识格局。
Automated monitoring plays a pivotal role in this ecosystem by continually assessing the quality and relevance of the LLM’s outputs. This involves setting up systems that track the performance of the LLM, identifying areas where responses may be falling short due to outdated information or missing contexts. When such gaps are identified, the monitoring system can trigger an incremental update process, ensuring that the database remains a robust and accurate reflection of the current knowledge landscape.
通过采用这些策略,我们确保 LangChain 和类似的 RAG 框架能够随着时间的推移保持其有效性。这种方法不仅增强了LLMs申请的相关性,还确保它们能够适应快速发展的信息格局,保持其在NLP 技术前沿的地位。
By embracing these strategies, we ensure that LangChain and similar RAG frameworks can sustain their effectiveness over time. This approach not only enhances the relevance of LLM applications but also ensures that they can adapt to the rapidly evolving landscape of information, maintaining their position at the forefront of NLP technology.
We can now get hands-on with LangChain.
我们现在准备建立一个完整的管道稍后可以借给各种NLP 应用程序。
We are now ready to set up a complete pipeline that can later be lent to various NLP applications.
请参阅Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb笔记本。该笔记本实现了LangChain框架。我们将一步步走过它,e解释不同之处建筑模块。我们在这里选择了一个简单的用例,因为该代码的要点是展示如何设置LangChain 管道。
Refer to the Ch8_Setting_Up_LangChain_Configurations_and_Pipeline.ipynb notebook. This notebook implements the LangChain framework. We will walk through it step by step, explaining the different building blocks. We chose a simple use case here, as the main point of this code is to show how to set up a LangChain pipeline.
在这种情况下,我们处于医疗保健行业。我们有很多照顾者;每个人都有很多他们可能会见的病人。主任医师代表医院的所有医生提出了一个请求,要求能够在他们的笔记中使用智能搜索。他们听说了LLMs的新兴功能,他们希望有一个工具可以在他们撰写的医疗报告中进行搜索。
In this scenario, we are in the healthcare sector. We have many care givers; each has many patients they may see. The physician in chief made a request on behalf of all the physicians in the hospital to be able to use a smart search across their notes. They heard about the new emerging capabilities with LLMs, and they would like to have a tool where they can search within the medical reports they wrote.
例如,一位医生说过这样的话:
For instance, one physician said the following:
“我经常遇到可能与我几个月前见过的一位患者相关的研究,但我不记得那是谁了。我希望有一个工具,我可以询问“那个抱怨耳朵疼痛并且有偏头痛家族史的病人是谁?”,它会找到我那个病人。 ”
“I often come across research that may be relevant to a patient I saw months ago, but I don’t recall who that was. I would like to have a tool where I can ask, ‘Who was that patient that complained about ear pain and had a family history of migraines?’, and it would find me that patient.”
Thus, the business objective here is as follows:
“ CTO 要求我们以 Jupyter 笔记本的形式构建一个快速原型。我们会从医院的数据库中收集几份临床报告,然后按照例子中医生描述的方式,使用LangChain进行搜索。”
“The CTO tasked us with putting together a quick prototype in the form of a Jupyter notebook. We will collect several clinical reports from the hospital’s database, and we will use LangChain to search through them in the manner that the physician in the example described.”
Let’s jump right in by designing the solution in Python.
深入探讨LangChain的实用性,这本节将逐步指导您使用Python设置LangChain管道,从安装必要的库到exe简化复杂的相似性搜索。
Diving into the practicalities of LangChain, this section will guide you step by step in setting up a LangChain pipeline using Python, from installing the necessary libraries to executing sophisticated similarity searches.
一如既往,我们有一个图书馆列表我们需要安装。由于我们在 Jupyter Notebook 中编写代码,我们可以从代码中安装它们:
As always, we have a list of libraries that we will need to install. Since we are writing the code in a Jupyter notebook, we can install them from within the code:
正如我们所看到的,相似性搜索功能能够很好地解决大多数问题。它嵌入问题并查找嵌入相似的报告。
As we can see, the similarity search function is able to do a good job with most of the questions. It embeds the question and looks for reports whose embeddings are similar.
然而,当涉及到正确回答问题时,相似性搜索只能做到这一点。这很容易想出一个讨论与其中一个笔记非常相似的问题的问题,但细微的差异会混淆相似性搜索机制。例如,相似性搜索过程实际上在第二个问题中犯了一个错误,错误地弄错了不同的月份,从而提供了错误的答案。
However, a similarity search could only go so far when it comes to answering the question correctly. It is easy to think of a question that discusses a matter that is very similar to one of the notes, yet a minor difference confuses the similarity search mechanism. For instance, the similarity search process actually makes a mistake in question two, mistaking different months and, thus, providing a wrong answer.
为了解决这个问题,我们要做的不仅仅是相似性搜索。我们希望LLMs能够审查相似性搜索的结果并应用其判断。我们w我会在下一章看到这是如何完成的。
In order to overcome this matter, we would want to do more than just a similarity search. We would want an LLM to review the results of the similarity search and apply its judgment. We will see how that’s done in the next chapter.
为 LangChain 在 Python 中的实际应用奠定了基础,现在让我们继续了解云如何发挥关键作用,特别是在利用 L 的真正潜力时当代计算范式中的语言模型。
With our foundation set for LangChain’s practical applications in Python, let’s now move on to understanding how the cloud plays a pivotal role, especially when harnessing the true potential of LLMs in contemporary computational paradigms.
在这个大数据和计算时代,云平台已成为管理大规模计算的重要工具,提供可以以最少的管理工作快速配置和发布的基础设施、存储和服务。
In this era of big data and computation, cloud platforms have emerged as vital tools for managing large-scale computations, providing infrastructure, storage, and services that can be rapidly provisioned and released with minimal management effort.
本节将重点介绍云中的计算环境。这些已成为许多领先企业和机构的主导选择。作为一个组织,拥有计算能力云中的环境与本地环境存在重大差异。它影响共享资源以及管理分配、维护和成本的能力。使用云服务而不是拥有物理机有很多权衡。您可以通过在线搜索甚至向聊天LLMs询问有关它们的信息来了解它们。
This section will focus on computation environments in the cloud. These have become the dominant choice for many leading companies and institutions. As an organization, having a computation environment in the cloud versus on-premises makes a major difference. It impacts the ability to share resources and manage allocations, maintenance, and cost. There are many trade-offs for employing cloud services instead of owning physical machines. You can learn about them by searching online or even asking a chat LLM about them.
云计算的一个显着区别是提供商围绕其构建的生态系统。当您选择云提供商作为您的计算中心时,您就可以利用一整套附加产品和服务,从而打开一个新的功能世界,否则您将无法获得这些功能。
One significant difference with cloud computing is the ecosystem that the providers have built around it. When you pick a cloud provider as your computation hub, you tap into a whole suite of additional products and services, opening up a new world of capabilities that would not be as accessible to you otherwise.
In this section, we will focus on the LLM aspect of those services.
三个主要的云平台是 AWS、Microsoft Azure 和 GCP。这些平台提供多种服务,满足企业和开发人员的不同需求。当涉及到 NLP 和 LLM 时,每个平台都提供专门的资源和服务来促进实验、部署和生产。
The three primary cloud platforms are AWS, Microsoft Azure, and GCP. These platforms offer a myriad of services, catering to the varying needs of businesses and developers. When it comes to NLP and LLMs, each platform provides dedicated resources and services to facilitate experimentation, deployment, and production.
Let’s explore each of these platforms to see how they cater to our specific needs.
AWS 仍然是该领域的主导力量云计算领域,提供全面且不断发展的套件满足机器学习和人工智能开发需求的服务。AWS 以其强大的基础设施、广泛的服务产品以及与 ML 工具和框架的深度集成而闻名,这使其成为寻求利用LLMs进行创新的开发人员和数据科学家的首选平台。
AWS remains a dominant force in the cloud computing landscape, providing a comprehensive and evolving suite of services that cater to the needs of ML and AI development. AWS is renowned for its robust infrastructure, extensive service offerings, and deep integration with ML tools and frameworks, making it a preferred platform for developers and data scientists looking to innovate with LLMs.
AWS 提供了丰富的生态系统旨在促进使用LLMs进行开发和实验,确保研究人员和开发人员能够获得最先进的机器学习功能:
AWS provides a rich ecosystem of tools and services designed to facilitate the development and experimentation with LLMs, ensuring that researchers and developers have access to the most advanced ML capabilities:
AWS 提供了一套服务,旨在大规模高效部署和管理LLMs,确保模型易于使用在不同负载下的可访问性和性能:
AWS provides a suite of services designed to efficiently deploy and manage LLMs at scale, ensuring that models are easily accessible and performant under varying loads:
让我们继续下一个主题,Microsoft Aure。
Let’s move on to the next topic, Microsoft Aure.
微软 Azure 站在云计算服务的前沿,为 ML 和 LLM 的开发、部署和管理提供强大的平台。利用与 OpenAI 的战略合作伙伴关系,Azure 提供独家云访问GPT 模型,将自己定位为旨在利用先进 NLP 技术力量的开发人员和数据科学家的关键资源。最近的增强功能扩展了 Azure 的功能,对于那些希望突破 AI 和 ML 应用程序界限的人来说,它成为更具吸引力的选择。
Microsoft Azure stands at the forefront of cloud computing services, offering a robust platform for the development, deployment, and management of ML and LLMs. Leveraging its strategic partnership with OpenAI, Azure provides exclusive cloud access to GPT models, positioning itself as a critical resource for developers and data scientists aiming to harness the power of advanced NLP technologies. Recent enhancements have expanded Azure’s capabilities, making it an even more attractive choice for those looking to push the boundaries of AI and ML applications.
Azure 显着丰富了其支持LLMs的研究和实验,提供各种工具和平台,以满足的多样化需求人工智能开发社区:
Azure has significantly enriched its offerings to support research and experimentation with LLMs, providing a variety of tools and platforms that cater to the diverse needs of the AI development community:
Azure 的基础架构和服务提供用于LLM应用程序部署和生产的全面解决方案,确保可扩展性、性能和安全性:
Azure’s infrastructure and services offer comprehensive solutions for the deployment and productionization of LLM applications, ensuring scalability, performance, and security:
GCP 继续在以下领域发挥重要作用:云计算,提供广泛的服务,以满足人工智能和机器学习开发不断变化的需求。GCP 以其在人工智能和机器学习领域的尖端创新而闻名,提供丰富的工具和服务生态系统,促进LLMs的开发、部署和扩展,使其成为旨在利用最新人工智能技术的开发人员和研究人员的理想平台。
GCP continues to be a powerhouse in cloud computing, providing an extensive suite of services that cater to the evolving needs of AI and ML development. Known for its cutting-edge innovations in AI and ML, GCP offers a rich ecosystem of tools and services that facilitate the development, deployment, and scaling of LLMs, making it an ideal platform for developers and researchers aiming to leverage the latest in AI technology.
GCP 进一步增强了其能力试验和开发LLMs,提供一套全面的工具,支持整个机器学习工作流程,从数据摄取和模型训练到超参数调整和评估:
GCP has further enhanced its capabilities for experimenting with and developing LLMs, offering a comprehensive set of tools that support the entire ML workflow, from data ingestion and model training to hyperparameter tuning and evaluation:
GCP 为部署和生产 LLM 提供强大且可扩展的解决方案,确保在其平台上构建的应用程序可以满足实际使用需求:
GCP provides robust and scalable solutions for deploying and productionizing LLMs, ensuring that applications built on its platform can meet the demands of real-world usage:
云计算的前景继续快速发展,AWS、Azure 和 GCP 都为 LLM 的开发和部署提供了独特的优势。AWS 因其广泛的基础设施以及与 ML 工具的深度集成而脱颖而出,使其成为各种 ML 和 AI 项目的理想选择。Azure 凭借对 OpenAI 模型的独家访问权限以及与 Microsoft 生态系统的深度集成,为希望利用人工智能尖端技术的企业提供了无与伦比的机会。GCP 因其在人工智能和机器学习领域的创新而受到认可,提供反映谷歌内部人工智能进步的工具和服务,吸引那些寻求最新人工智能研究和开发的人。随着这些平台的功能不断扩展,它们之间的选择将越来越取决于具体的项目需求、组织协调和战略合作伙伴关系,这凸显了深思熟虑的重要性l 根据基于云的人工智能和机器学习的当前和未来前景进行评估。
The landscape of cloud computing continues to evolve rapidly, with AWS, Azure, and GCP each offering unique advantages for the development and deployment of LLMs. AWS stands out for its broad infrastructure and deep integration with ML tools, making it ideal for a wide range of ML and AI projects. Azure, with its exclusive access to OpenAI’s models and deep integration within the Microsoft ecosystem, offers unparalleled opportunities for enterprises looking to leverage the cutting edge of AI technology. GCP, recognized for its innovation in AI and ML, provides tools and services that mirror Google’s internal AI advancements, appealing to those seeking the latest in AI research and development. As the capabilities of these platforms continue to expand, the choice between them will increasingly depend on specific project needs, organizational alignment, and strategic partnerships, underscoring the importance of a thoughtful evaluation based on the current and future landscape of cloud-based AI and ML.
随着 NLP 和 LLM 领域的持续快速发展,系统设计的各种实践也在快速发展。在本章中,我们回顾了LLM申请和管道的设计过程。我们讨论了这些方法的组成部分,涉及基于 API 的闭源和本地开源解决方案。然后我们为您提供了代码实践经验。
As the world of NLP and LLMs continues to grow rapidly, so do the various practices of system design. In this chapter, we reviewed the design process of LLM applications and pipelines. We discussed the components of these approaches, touching on both API-based closed source and local open source solutions. We then gave you hands-on experience with code.
后来我们深入研究了系统设计过程,引入了LangChain。我们回顾了 LangChain 的组成部分,并用代码中的示例管道进行了实验。
We later delved deeper into the system design process and introduced LangChain. We reviewed what LangChain comprises and experimented with an example pipeline in code.
为了补充系统设计过程,我们调查了领先的云服务,这些服务允许您试验、开发和部署基于 LLM 的解决方案。
To complement the system design process, we surveyed leading cloud services that allow you to experiment, develop, and deploy LLM-based solutions.
在下一章中,我们将重点关注特定的实际用例并附上代码。
In the next chapter, we’ll focus on particular practical use cases, accompanied with code.
在快速发展的自然语言处理( NLP )领域,大型语言模型( LLM ) 标志着革命性的一步,重塑了我们与信息交互、自动化流程以及从庞大数据池中获取见解的方式。本章代表了我们 NLP 方法的出现和发展历程的顶峰。正是在这里,前几章奠定的理论基础与实际的前沿应用相结合,阐明了LLMs在使用正确的工具和技术时的卓越能力。
In the rapidly evolving landscape of natural language processing (NLP), large language models (LLMs) have marked a revolutionary step forward, reshaping how we interact with information, automate processes, and derive insights from vast data pools. This chapter represents the culmination of our journey through the emergence and development of NLP methods. It is here that the theoretical foundations laid in previous chapters converge with practical, cutting-edge applications, illuminating the remarkable capabilities of LLMs when harnessed with the right tools and techniques.
我们深入研究了 LLM 应用程序中最新且令人兴奋的进步,并通过专为实践学习而设计的详细 Python 代码示例来呈现。这种方法不仅展示了LLMs的力量,而且还使您具备在现实场景中实施这些技术的技能。本章涵盖的主题经过精心挑选,以展示一系列高级功能和应用程序。
We delve into the most recent and thrilling advancements in LLM applications, presented through detailed Python code examples designed for hands-on learning. This approach not only illustrates the power of LLMs but also equips you with the skills to implement these technologies in real-world scenarios. The subjects covered in this chapter are meticulously selected to showcase a spectrum of advanced functionalities and applications.
本章的重要性怎么强调都不为过。它不仅反映了 NLP 的最新技术,而且还充当了通向未来的桥梁,将这些技术无缝集成到日常解决方案中。在本章结束时,您将全面了解如何应用最新的LLMs技术和创新,使您能够突破 NLP 及其他领域的可能性界限。与我们一起踏上这段激动人心的旅程,释放LLMs的全部潜力。
The importance of this chapter cannot be overstated. It not only reflects the state of the art in NLP but also serves as a bridge to the future, where the integration of these technologies into everyday solutions becomes seamless. By the end of this chapter, you will have a comprehensive understanding of how to apply the latest LLM techniques and innovations, empowering you to push the boundaries of what’s possible in NLP and beyond. Join us on this exciting journey to unlock the full potential of LLMs.
让我们看一下本章中涵盖的主要标题:
Let’s go through the main headings covered in the chapter:
对于本章,需要以下内容:
For this chapter, the following will be necessary:
现在我们已经使用 API 在本地设置了 LLM 应用程序,我们终于可以部署 LLM 的高级应用程序,让我们能够利用其巨大的力量。
Now that we’ve set up LLM applications using APIs and locally, we can finally deploy the advanced applications of LLMs that let us leverage their immense power.
检索增强生成(RAG )框架在为特定的大型语言模型(LLM)定制方面发挥了重要作用。领域或任务,弥合即时工程的简单性和复杂性之间的差距的模型微调。
The retrieval-augmented generation (RAG) framework has become instrumental in tailoring large language models (LLMs) for specific domains or tasks, bridging the gap between the simplicity of prompt engineering and the complexity of model fine-tuning.
及时工程是最初的、最容易获得的技术定制LLMs。它利用模型根据输入提示解释和响应查询的能力。例如,要询问 Nvidia 在其最新公告中是否超出了盈利预期,在提示中直接提供财报电话会议内容可以弥补LLMs缺乏即时、最新背景的情况。这种方法虽然简单,但取决于模型在单个或一系列精心设计的提示中消化和分析所提供信息的能力。
Prompt engineering stands as the initial, most accessible technique for customizing LLMs. It leverages the model’s capacity to interpret and respond to queries based on the input prompt. For example, to inquire if Nvidia surpassed earnings expectations in its latest announcement, directly providing the earnings call content within the prompt can compensate for the LLM’s lack of immediate, up-to-date context. This approach, while straightforward, hinges on the model’s ability to digest and analyze the provided information within a single or a series of carefully crafted prompts.
当调查范围超出即时工程所能容纳的范围时(例如分析科技行业十年的盈利电话会议),RAG 就变得不可或缺。在采用 RAG 之前,替代方案是微调,这是一个资源密集型过程,需要对LLMs的架构进行重大调整以纳入广泛的数据集。RAG 通过在向量数据库中预处理和存储大量数据来简化这一过程。它智能地隔离和检索与查询相关的数据段,有效地将大量信息压缩为LLMs可管理的、即时大小的上下文。这项创新大大减少了此类广泛的数据熟悉任务所需的时间、资源和专业知识。
When the scope of inquiry exceeds what prompt engineering can accommodate—such as analyzing a decade’s worth of tech sector earnings calls—RAG becomes indispensable. Prior to RAG’s adoption, the alternative was fine-tuning, a resource-intensive process requiring significant adjustments to the LLM’s architecture to incorporate extensive datasets. RAG simplifies this by preprocessing and storing large amounts of data in a vector database. It intelligently isolates and retrieves the data segments pertinent to the query, effectively condensing the vast information into a manageable, prompt-size context for the LLM. This innovation drastically reduces the time, resources, and expertise needed for such extensive data familiarization tasks.
在第8章中,我们介绍了RAG的一般概念,特别是LangChain,这是一个以其先进功能而著称的RAG框架。
In Chapter 8, we introduced the general concept of RAGs and, in particular, LangChain, a RAG framework distinguished by its advanced capabilities.
我们现在将讨论 LangChain 为增强 LLM 应用程序提供的其他独特功能,为您提供对其在复杂NLP 任务中的实施和实用性的实用见解。
We will now discuss the additional unique features LangChain offers for enhancing LLM applications, providing you with practical insights into its implementation and utility in complex NLP tasks.
在本节中,我们将从第 8 章中的最后一个示例继续。在这种情况下,我们处于在医疗保健部门,在我们的医院,我们的护理人员表示需要能够根据患者或其病情的粗略描述快速显示患者的记录。例如,“我去年见到的那个怀三胞胎的病人是谁?” “我是否曾经遇到过一位父母都有癌症病史并且他们对临床试验感兴趣的患者?”等等。
In this section, we will pick up where we left off with our last example from Chapter 8. In this scenario, we are in the healthcare sector, and in our hospital, our care providers are expressing a need to be able to quickly surface patients’ records based on rough descriptions of the patient or their condition. For example, “Who was that patient I saw last year who was pregnant with triplets?” “Did I ever have a patient with a history of cancer from both of their parents and they were interested in a clinical trial?” and so on.
重要的提示
Important note
我们强调,这些不是真正的医疗笔记,笔记中描述的人也不是真实的。
We stress that these aren’t real medical notes and that the people described in the notes aren’t real.
在第 8 章的示例中,我们通过简单地利用临床注释嵌入的向量数据库,将管道的复杂性保持在最低水平,然后我们应用相似性搜索来根据简单的请求查找注释。我们注意到其中一个问题(第二个问题)如何通过相似性搜索算法得到错误的答案。
In our example in Chapter 8, we kept the pipeline at minimum complexity by simply leveraging the vector databases of embeddings of clinical notes, and then we applied similarity search to look for notes based on simple requests. We noticed how one of the questions, the second question, received a wrong answer with the similarity search algorithm.
我们现在将加强该渠道。我们不会满足于相似性搜索的结果并将其呈现给医生;我们将采用那些被认为内容与请求相似的结果,并且我们将聘请LLMs来仔细检查这些结果,对其进行审查,并告诉我们哪些结果确实与医生相关。
We will now enhance that pipeline. We will not settle for the results of the similarity search and surface those to the physicians; we will take those results that were deemed to be similar in content to the request, and we will employ an LLM to go through these results, vet them, and tell us which ones are indeed relevant to the physician.
我们将使用这个管道举例说明任一类型的LLMs(付费或免费)的实用性。我们通过paid_vs_free变量让您选择使用OpenAI的付费GPT模型或免费的LLM。使用 OpenAI 的付费模型将利用他们的 API,并且需要 API 密钥。然而,免费的 LLM 被导入到运行 Python 代码的本地环境中,因此任何拥有互联网连接和足够计算资源的人都可以使用它。
We'll use this pipeline to exemplify the utility of either type of LLM, paid or free. We give you the choice, via the paid_vs_free variable, to either use OpenAI’s paid GPT model or a free LLM. Using OpenAI’s paid model would leverage their API and would require an API key. However, the free LLM is imported to the local environment where the Python code is run, thus making it available to anyone who has an internet connection and sufficient computational resources.
让我们开始动手实践并试验代码。
Let’s start getting hands-on and experimenting with the code.
请参阅以下笔记本:Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb。
Refer to the following notebook: Ch9_Advanced_LangChain_Configurations_and_Pipeline.ipynb.
请注意,第一个笔记本的一部分与第 8 章中的笔记本相同,因此我们将跳过该部分的描述。
Note that the first part of the notebook is identical to the notebook from Chapter 8, so we will skip the description of that part.
在这里,我们需要扩展一组已安装的库并安装openai和gpt4all。此外,为了利用gpt4all,我们需要从网络下载.bin文件。
Here, we need to expand the set of installed libraries and install openai and gpt4all. Moreover, in order to utilize gpt4all, we will need to download a .bin file from the web.
这两个步骤很容易通过笔记本执行。
These two steps are easy to perform via the notebook.
如上所述,我们让您选择无论您想通过 OpenAI 的付费 API 还是免费的 LLM 来运行此示例。
As explained above, we let you choose whether you want to run this example via a paid API by OpenAI or a free LLM.
请记住,由于 OpenAI 的服务包括托管 LLM 和处理提示,因此它需要最少的资源和时间以及基本的互联网连接。它还涉及将我们的提示发送到 OpenAI 的 API 服务。提示通常包括在现实环境中可能是专有的信息。因此,需要就数据的安全性做出执行决策。在过去的十年中,类似的考虑对于公司计算从本地向云的转变至关重要。
Remember, since OpenAI’s service includes hosting the LLM and processing the prompts, it requires minimal resources and time and a basic internet connection. It also involves sending our prompts to OpenAI’s API service. Prompts typically include information that, in real-world settings, may be proprietary. Thus, an executive decision needs to be made regarding the security of the data. Similar considerations were central, in the last decade, to the transition of companies’ computation from on-premises to the cloud.
与该要求相反,使用免费的LLMs,您将在本地托管它,您将避免将任何信息导出到计算环境之外,但您将承担处理工作。
In contrast to that requirement, with a free LLM, you would host it locally, you would avoid exporting any information outside of your computation environment, but you would take on the processing.
另一个需要考虑的方面是每个LLMs的使用条款,因为每个LLMs可能有不同的许可条款。虽然LLMs可能允许您免费试用它,但它可能会限制您是否可以在商业产品中使用它。
Another aspect to consider is the terms of use of each LLM, as each may have different license terms. While an LLM may allow you to experiment with it for free, it may present constrictions on whether you may use it in a commercial product.
在运行时间和计算资源受到限制的情况下,为此示例选择付费LLMs将产生更快的响应。
In the context of constraints around runtime and computational resources, choosing the paid LLM for this example will yield quicker responses.
为了满足您尝试免费 LLM 的愿望,并且由于我们渴望让您在 Google Colab 上免费快速运行代码,因此我们必须将 LLM 的选择限制为那些可以在 Google 允许我们通过免费帐户拥有的有限 RAM 上运行。为了做到这一点,我们选择了精度较低的LLMs,也称为量化LLMs。
In order to accommodate your wish to experiment with a free LLM, and since we aspire to let you run the code quickly and for free on Google Colab, we must restrict our choice of LLMs to those that can be run on the limited RAM that Google lets us have with a free account. In order to do that, we chose an LLM with reduced precision, also known as a quantized LLM.
根据您在基于 API 的 LLM 和免费本地 LLM 之间的选择,LLM 将分配给llm 变量。
Based on your choice between an API-based LLM and a free local LLM, the LLM will be assigned to the llm variable.
在这里,我们设置了一个RAG 框架。它旨在接受各种文本文档并设置它们以供检索。
Here, we set up a RAG framework. It is designed to accept various text documents and set them up for retrieval.
我们现在将运行与第 8 章示例中完全相同的请求。这些将在相同的音符和持有相同嵌入的相同矢量数据库上执行。这些都没有改变或增强。不同之处在于我们将让LLMs监督答案的处理。
We will now run the exact same requests as we did in the example in Chapter 8. Those will be performed across the same notes, and the same vector DB that holds the same embedding. None of that has been changed or enhanced. The difference is that we will have the LLM oversee the processing of the answers.
在第 8 章中,我们看到第二个问题得到了错误的答案。问题是,“有9月份预产期的孕妇吗?”
In Chapter 8, we saw that question number two received a wrong answer. The question was, “Are there any pregnant patients who are due to deliver in September?”
我们在第 8 章看到的答案是关于一位即将在八月分娩的患者的。该错误是由于相似度算法的缺陷造成的。确实,该患者的笔记内容与问题的内容相似,但不同月份分娩的细节应该是使这些笔记变得无关紧要的因素。
The answer we saw in Chapter 8 was about a patient who is due to give birth in August. The mistake was due to the deficiency of the similarity algorithm. Indeed, that patient’s notes had content similar to that of the question, but the fine detail of giving birth in a different month should have been the factor that made those note irrelevant.
在这里,在我们当前应用 OpenAI LLMs的管道中,它是正确的,告诉我们没有预计于 9 月份分娩的患者。
Here, in our current pipeline, where OpenAI’s LLM is applied, it gets it right, telling us that there are no patients who are due to deliver in September.
请注意,当选择免费的LLMs时,它会出错。这说明了该模型的次优方面,因为它是为了节省RAM 需求而进行量化的。
Note that when opting for the free LLM, it gets it wrong. This exemplifies the sub-optimal aspects of that model, as it is quantized in an effort to save on RAM requirements.
为了总结这个例子,我们整合了一个内部搜索机制,让用户(在我们的例子中是一名医生)搜索患者的笔记,根据某些信息来查找患者。标准。该系统设计的一个独特之处是能够让LLMs从外部数据源检索相关答案,而不受其训练数据的限制。这个范式是 RAG 的基础。
To conclude this example, we have put together an in-house search mechanism that lets the user, in our example, a physician, search through their patients’ notes to find patients based on some criteria. A unique aspect of this system design is the ability to let the LLM retrieve the relevant answer from an external data source and not be limited to the data it was trained on. This paradigm is the basis of RAG.
在下一节中,我们将展示LLMs的更多用途。
In the next section, we will showcase more uses for LLMs.
在本节中,我们将继续探索如何利用 LLM 管道。我们将重点关注连锁店。
In this section, we will continue our exploration of ways one can utilize LLM pipelines. We will focus on chains.
请参阅以下笔记本:Ch9_Advanced_Methods_with_Chains.ipynb。这款笔记本展现了链条的演变管道,因为每次迭代都体现了 LangChain 允许我们使用的另一个功能。
Refer to the following notebook: Ch9_Advanced_Methods_with_Chains.ipynb. This notebook presents an evolution of a chain pipeline, as every iteration exemplifies another feature that LangChain allows us to employ.
为了使用最少的计算资源、内存和时间,我们使用 OpenAI 的 API。您可以选择使用免费的 LLM,并且可以采用与本章前面示例中设置笔记本的方式类似的方式进行操作。
For the sake of using minimal computational resources, memory, and time, we use OpenAI’s API. You can choose to use a free LLM instead and may do so in a similar way to how we set up the notebook from the previous example in this chapter.
笔记本一如既往地从基本配置开始,因此我们可以跳到回顾笔记本的内容。
The notebook starts with the basic configurations, as always, so we can skip to reviewing the notebook’s content.
在这个例子中,我们想要使用LLMs告诉我们一个简单问题的答案,该问题需要经过培训的LLMs应具备的常识:
In this example, we want to use the LLM to tell us an answer to a simple question that would require common knowledge that a trained LLM is expected to have:
“Metallica 的成员有哪些。将他们以逗号分隔列出。”
"Who are the members of Metallica. List them as comma separated."
然后,我们定义一个名为LLMChain的简单链,并为其提供LLM变量和提示。
We then define a simple chain called LLMChain, and we feed it with the LLM variable and the prompt.
The LLM, indeed, knows the answer from its knowledge base and returns:
“詹姆斯·海特菲尔德、拉尔斯·乌尔里希、柯克·哈米特、罗伯特·特鲁希略”
'James Hetfield, Lars Ulrich, Kirk Hammett, Robert Trujillo' 这一次,我们希望输出采用特定语法,可能允许我们以计算方式将其用于下游任务:
This time, we would like the output to be in a particular syntax, potentially allowing us to use it in a computational manner for downstream tasks:
“以逗号分隔列表的形式列出周期表中的前 10 个元素。”
"List the first 10 elements from the periodical table as comma separated list." 现在,我们添加一个用于实现语法的功能。我们定义了output_parser 变量,并使用不同的函数来生成输出,predict_and_parse()。
Now, we add a feature for achieving the syntax. We define the output_parser variable, and we use a different function for generating the output, predict_and_parse().
输出如下:
The output is the following:
['氢',
'氦',
'锂',
'铍',
'硼',
'碳',
'氮',
'氧',
'氟',
'氖']
['Hydrogen',
'Helium',
'Lithium',
'Beryllium',
'Boron',
'Carbon',
'Nitrogen',
'Oxygen',
'Fluorine',
'Neon'] 该功能带来了价值链的新水平。到目前为止,提示还没有任何上下文。LLMs独立处理每个提示。例如,如果您想提出后续问题,则不能。该管道没有您之前的提示以及对它们的响应作为参考。
This feature brings a new level of value to the chain. Until this point, the prompts didn’t have any context. The LLM processed each prompt independently. For instance, if you wanted to ask a follow-up question, you couldn’t. The pipeline didn’t have your prior prompts and the responses to them as reference.
为了从提出杂乱的问题到获得持续的、滚动的类似对话的体验,LangChain 提供了ConversationChain()。在此函数中,我们有一个内存参数,它将先前与链的交互映射到当前提示。因此,提示模板就是记忆“所在”的地方。
In order to go from asking disjointed questions to having an ongoing, rolling conversation-like experience, LangChain offers ConversationChain(). Within this function, we have a memory parameter that maps the prior interactions with the chain to the current prompt. Therefore, the prompt template is where that memory “lives.”
而不是使用基本模板进行提示,例如
Instead of prompting with a basic template, such as
“以逗号分隔列表形式列出您知道的所有假期。”
"List all the holidays you know as comma separated list." 模板现在可容纳内存功能:
the template now accommodates the memory feature:
“当前对话:
{历史}
你的任务:
{输入}}”
"Current conversation:
{history}
Your task:
{input}}" 在这里,您可以认为该字符串的格式类似于 Python f"…"字符串,其中历史记录和输入是字符串变量。ConversationChain ()函数处理此提示模板并插入这两个变量以完成提示字符串。当我们激活内存机制时,输入变量是由函数本身生成的,然后当我们运行以下命令时,输入变量由我们提供:
Here, you can think of this string as being formatted similarly to a Python f"…" string, where history and input are string variables. The ConversationChain() function processes this prompt template and inserts these two variables to complete the prompt string. The input variable is produced by the function itself as we activate the memory mechanism, and the input variable is then supplied by us as we run the following:
conversation.predict_and_parse(input="写下您知道的前 10 个假期,以逗号分隔的列表形式。")
conversation.predict_and_parse(input="Write the first 10 holidays you know, as a comma separated list.") 其中输出如下:
Where the output is the following:
['圣诞节', '感恩', “元旦”, '万圣节', '复活节', '独立日', “情人节”, “圣帕特里克节”, '劳动节', '纪念日']
['Christmas', 'Thanksgiving', "New Year's Day", 'Halloween', 'Easter', 'Independence Day', "Valentine's Day", "St. Patrick's Day", 'Labor Day', 'Memorial Day']
现在我们来做一下后续仅在先前请求和输出的上下文中才能理解的请求:
Now, let’s make a follow-up request that would only be understood in the context of the previous request and output:
对话.predict_and_parse(input="观察您打印的假期列表,并从列表中删除所有非宗教假期。")
conversation.predict_and_parse(input=" Observe the list of holidays you printed and remove all the non-religious holidays from the list.") 事实上,我们得到了适当的输出:
Indeed, we get the appropriate output:
['圣诞节', '感恩', “元旦”, '复活节', “情人节”, “圣帕特里克节,”]
['Christmas', 'Thanksgiving', "New Year's Day", 'Easter', "Valentine's Day", "St. Patrick's Day,"]
为了完成这个示例,我们假设我们的目的是快速生成一些假期的表格,其中包括它们的名称和描述:
To complete this example, let’s assume the intention we had was to quickly generate a table of some holidays that includes their names and descriptions:
“对于每一个,用两句话讲述这个假期。 将输出形成 json 格式表。 表的名称是“假期”,字段是“名称”和“描述”。 对于每一行,“名称”是假期的名称,“描述”是您生成的描述。 输出的语法应该是 json 格式,没有换行符。”
"For each of these, tell about the holiday in 2 sentences. Form the output in a json format table. The table's name is "holidays" and the fields are "name" and "description". For each row, the "name" is the holiday's name, and the "description" is the description you generated. The syntax of the output should be a json format, without newline characters."
Now, we get a formatted string from the chain:
{ “假期”:[ { “名称”:“圣诞节”, “description”:“圣诞节是一个宗教节日,庆祝耶稣基督的诞生,并被广泛视为一种世俗文化和商业现象。” }, { "name": "感恩节", "description": "感恩节是美国的国定假日,于 11 月的第四个星期四庆祝,起源于丰收节。" }, { “名称”:“复活节”, “描述”:“复活节是[...]
{ "holidays": [ { "name": "Christmas", "description": "Christmas is a religious holiday that celebrates the birth of Jesus Christ and is widely observed as a secular cultural and commercial phenomenon." }, { "name": "Thanksgiving", "description": "Thanksgiving is a national holiday in the United States, celebrated on the fourth Thursday of November, and originated as a harvest festival." }, { "name": "Easter", "description": "Easter is […]
然后我们可以使用 pandas 将此字符串转换为表格:
We can then use pandas to convert this string to a table:
dict = json.loads(输出)
pd.json_normalize(dict[“假期”])
dict = json.loads(output)
pd.json_normalize(dict[ "holidays"]) pandas 将dict处理为 DataFrame 后,我们可以在表 9.1中观察到:
After pandas processes dict to be a DataFrame, we can observe it in Table 9.1:
表 9.1 – pandas 将表从 dict 转换为 DataFrame,从而适合下游处理
Table 9.1 – pandas transformed the table from dict to a DataFrame, thus suiting down-stream processing
这就是这款笔记本电脑所呈现的各种连锁功能。请注意我们如何利用这两家连锁店和LLMs为我们带来的功能。例如,虽然内存和解析功能完全在链侧处理,但以特定格式(例如 JSON 格式)呈现响应的能力仅由LLM 认可。
This concludes the various chain features that this notebook presents. Notice how we leveraged the features that both chains bring us and that LLMs bring us. For instance, while the memory and parsing features are completely handled on the chain’s side, the ability to present a response in a particular format, such as a JSON format, is solely accredited to the LLM.
在我们的下一个例子中,我们将继续与LLMs和 LangChain 合作推出新颖的实用程序。
In our next example, we will continue to present novel utilities with LLMs and LangChain.
在这个例子中,我们将回顾利用LLMs访问网络并提取信息是多么简单。我们可能希望研究某个特定主题,因此我们希望整合来自几个网页、几个介绍该主题的 YouTube 视频等的所有信息。这样的努力可能需要一段时间,因为内容可能很大。例如,多个 YouTube 视频有时可能需要几个小时才能审核。通常,在观看视频的大部分内容之前,人们不知道该视频有多么有用。
In this example, we will review how simple it is to leverage LLMs to access the web and extract information. We may wish to research a particular topic, and so we would like to consolidate all the information from a few web pages, several YouTube videos that present that topic, and so on. Such an endeavor can take a while, as the content may be massive. For instance, several YouTube videos can sometimes take hours to review. Often, one doesn’t know how useful the video is until one has watched a significant portion of it.
另一个用例是实时跟踪各种趋势。这可能包括跟踪新闻来源、YouTube 视频等。在这里,速度至关重要。与前面的例子不同,速度对于节省我们的个人时间很重要,在这里,速度对于让我们的算法与识别实时新兴趋势相关是必要的。
Another use case is when looking to track various trends in real time. This may include tracking news sources, YouTube videos, and so on. Here, speed is crucial. Unlike the previous example where speed was important to save us personal time, here, speed is necessary for getting our algorithm to be relevant for identifying real-time emerging trends.
在本节中,我们整理了一个非常简单且有限的示例。
In this section, we put together a very simple and limited example.
请参阅以下笔记本:Ch9_Retrieve_Content_from_a_YouTube_Video_and_Summarize.ipynb ( https://github.com/embedchain/embedchain )。我们将在名为 EmbedChain ( https://github.com/embedchain/embedchain )的库上构建我们的应用程序。 EmbedChain 利用 RAG 框架并通过允许矢量数据库包含来自各种网络来源的信息来增强它。
Refer to the following notebook: Ch9_Retrieve_Content_from_a_YouTube_Video_and_Summarize.ipynb (https://github.com/embedchain/embedchain). We will build our application on a library called EmbedChain (https://github.com/embedchain/embedchain). EmbedChain leverages a RAG framework and enhances it by allowing the vector database to include information from various web sources.
在我们的示例中,我们将选择一个特定的 YouTube 视频(Robert Waldinger:什么造就了美好生活?关于幸福的最长研究的教训 | TED:https://www.youtube.com/watch ? v=8KkKuTCFvzI&ab_channel=TED)。我们希望将该视频的内容处理到 RAG 框架中。然后,我们将向LLMs提出与该视频内容相关的问题和任务,从而使我们能够提取我们想要了解的有关视频的所有内容,而无需观看视频。
In our example, we will choose a particular YouTube video (Robert Waldinger: What makes a good life? Lessons from the longest study on happiness | TED: https://www.youtube.com/watch?v=8KkKuTCFvzI&ab_channel=TED). We would like the content of that video to be processed into the RAG framework. Then, we will prompt an LLM with questions and tasks related to the content of that video, thus allowing us to extract everything we care to learn about the video without having to watch it.
应该强调的是,这种方法所依赖的一个关键特征是 YouTube 的许多口头视频都附有书面文字记录。这使得视频文本的导入上下文无缝。然而,如果人们希望将此方法应用于不附有文字记录的视频,这不是问题。人们需要选择一种语音到文本的模型,其中许多是免费的并且质量非常高。视频的音频将被处理,脚本将被提取,然后您可以将其导入RAG 进程。
It should be stressed that a key feature that this method relies on is that YouTube accompanies many of its verbal videos with a written transcript. This makes the importing of the video’s text context seamless. If, however, one wishes to apply this method to a video that isn’t accompanied by a transcript, this is not a problem. One would need to pick a speech-to-text model, many of which are free and of very high quality. The audio of the video would be processed, a transcript would be extracted, and you may then import it into the RAG process.
与之前的笔记本一样,我们也在这里安装必要的包,导入所有相关包,并设置我们的 OpenAI API 密钥。
As with previous notebooks, here too, we install the necessary packages, import all the relevant packages, and set our OpenAI API key.
然后我们执行以下操作:
We then do the following:
我们需要设置EmbedChain的RAG流程。我们指定要传递 YouTube 视频的路径,并提供视频的 URL。
We need to set EmbedChain’s RAG process. We specify that we are passing a path to a YouTube video, and we provide the video’s URL.
然后,我们可以打印出获取的文本,并验证它是否确实与我们要分析的视频一致。
We can then print out the text that was fetched and verify that it is, indeed, aligned with the video we are looking to analyze.
我们现在将观察这段代码产生的值。
We will now observe the value that this code yields.
我们要求LLMs审查内容,汇总摘要,并以英语、俄语和德语呈现该摘要:
We ask the LLM to review the content, to put together a summary, and to present that summary in English, Russian, and German:
请审阅整个内容,将其总结为 4 句话的长度,然后将其翻译成俄语和德语。 确保摘要与内容一致。 将字符串“\n----\n”放在答案的英语部分和俄语部分之间。 将字符串“\n****\n”放在答案的俄语部分和德语部分之间。
Please review the entire content, summarize it to the length of 4 sentence, then translate it to Russian and to German. Make sure the summary is consistent with the content. Put the string '\n----\n' between the English part of the answer and the Russian part. Put the string '\n****\n' between the Russian part of the answer and the German part.
返回的输出非常准确,因为它完全抓住了 TED 演讲的精髓。我们编辑它以删除分隔符字符串并得到:
The returned output is spot on, as it completely captures the essence of the TED talk. We edit it to remove the delimiter strings and get:
内容强调了良好的重要性 人际关系让我们保持快乐和健康 贯穿我们的一生。它讨论了如何社交 联系、密切关系的质量,以及 避免冲突对于我们的健康至关重要 存在。该研究追踪了 724 名男性的生活 75年,彰显意义 在领导一个组织时,人际关系胜过财富和名誉 充实的生活。 俄语: Содержание подчеркивает жность [...] 德语: 德尔 吸入 Betont die Bedeutung [...]
The content emphasizes the importance of good relationships in keeping us happy and healthy throughout our lives. It discusses how social connections, quality of close relationships, and avoiding conflict play crucial roles in our well- being. The study follows the lives of 724 men over 75 years, highlighting the significance of relationships over wealth and fame in leading a fulfilling life. Russian: Содержание подчеркивает Важность [...] German: Der Inhalt betont die Bedeutung [...]
现在,为了让内容对于说德语的人来说变得简单,我们要求LLMs将德语摘要形成几个最能描述视频内容的要点。
Now, to make the content simple for, say, a German speaker, we ask the LLM to form the German summary into several bullet points that best describe the content of the video.
It does this well, and the outputs are as follows:
- Betonung der Bedeutung guter Beziehungen für Glück und Gesundheit
- 讨论关于社会问题、质量问题和问题的讨论
- 724 Männern 和 75 Jahre 的学习生活的Verfolgung des Lebens
- 帝国与鲁姆的关系
- Fokus auf Beziehungen als Schlüssel zu einem erfüllten Leben
- Betonung der Bedeutung guter Beziehungen für Glück und Gesundheit
- Diskussion über soziale Verbindungen, Qualität enger Beziehungen und Konfliktvermeidung
- Verfolgung des Lebens von 724 Männern über 75 Jahre in der Studie
- Hervorhebung der Bedeutung von Beziehungen im Vergleich zu Reichtum und Ruhm
- Fokus auf Beziehungen als Schlüssel zu einem erfüllten Leben 虽然此代码旨在作为基本的概念证明,但人们可以看到添加更多数据源、使其自动运行并根据结果采取行动是多么简单。虽然可读的摘要很有帮助,但人们可以更改代码以根据识别的内容进行操作并执行下游应用程序。
While this code is meant to serve as a basic proof of concept, one can see how simple it would be to add more data sources, automate it to run constantly, and act based on the findings. While a readable summary is helpful, one could change the code to act based on the identified content and execute downstream applications.
现在我们已经观察到LLMs可以执行的几种功能,我们可以退后一步并改进我们利用这些LLMs的方式。在下一节中,我们将举例说明如何减少 LLM 处理,从而节省 API 成本,或者在使用本地 LLM 时减少推理计算。
Now that we have observed several capabilities that LLMs can perform, we can take a step back and refine the way we utilize those LLMs. In our next section, we will exemplify how one may reduce LLM processing, thus saving API costs, or, when employing a local LLM, reducing inference computation.
本部分致力于介绍使用基于 API 的 LLM 时资源优化的最新进展,例如 OpenAI 的服务。在考虑采用远程 LLM 作为服务和本地托管 LLM 之间的诸多权衡时,一个关键指标是成本。特别是,根据应用程序和使用情况,API 成本可能会累积到相当大的数额。API 成本主要由传入和传出LLM 服务的令牌数量决定。
This part is dedicated to a recent development in resource optimization for when employing API-based LLMs, such as OpenAI’s services. When considering the many trade-offs between employing a remote LLM as a service and hosting an LLM locally, one key metric is cost. In particular, based on the application and usage, the API costs can accumulate to a significant amount. API costs are mainly driven by the number of tokens that are being sent to and from the LLM service.
为了说明这种支付模式对业务计划的重要性,请考虑产品或服务依赖于对 OpenAI GPT 的 API 调用的业务部门,其中 OpenAI 作为第三方供应商。作为一个特定的例子,想象一个社交网络,它允许其用户获得LLMs的帮助来评论帖子。在该用例中,用户有兴趣对帖子进行评论,并且无需编写完整的评论,而是可以使用一项功能让用户用三到五个单词描述他们对帖子的感受,并且后端流程会增强完整的评论。
In order to illustrate the significance of this payment model on a business plan, consider business units for which the product or service relies on API calls to OpenAI’s GPT, where OpenAI serves as a third-party vendor. As a particular example, imagine a social network that lets its users have LLM assistance to comment on posts. In that use case, a user is interested in commenting on a post, and instead of having to write a complete comment, a feature lets the user describe their feelings about the post in three–five words, and a backend process augments a full comment.
在这个特定的示例中,引擎收集用户的三到五个单词,并且还收集该评论的帖子内容,这意味着它还将收集社交网络专家认为与之相关的所有其他相关信息增加评论。例如,用户的个人资料描述、他们过去的评论等等。
In this particular example, the engine collects the user’s three–five words, and it also collects the content of the post that the comment is meant for, meaning it will also collect all other relevant information that the social network’s experts would think is relevant for augmenting a comment. For instance, the user’s profile description, their past few comments, and so on.
这意味着每次用户希望增加评论时,社交网络的服务器都会通过 API向第三方的 LLM 发送详细的提示。
This would mean that every time a user wishes to have a comment augmented, a detailed prompt is sent from the social network’s servers to the third party’s LLM via theAPI.
现在,这种流程会积累很高的成本。
Now, this type of process can accumulate high costs.
在本节中,我们将分析一种通过减少通过 API 发送到 LLM 的令牌数量来降低成本的方法。基本假设是,人们总是可以减少发送给LLMs的字数,从而降低成本,但性能的降低可能会很大。我们的动机是在保持高质量性能的同时减少这一数量。然后我们询问是否只能发送“正确”的单词,而忽略其他“非物质”的单词。这个概念让我们想起了文件压缩的概念,即采用智能且定制的算法来减小文件的大小,同时保持其用途和价值。
In this section, we will analyze an approach to reducing this cost by reducing the number of tokens sent to the LLM through the API. The basic assumption is that one can always reduce the number of words sent to the LLM and, thus, reduce cost, but the reduction in performance could be significant. Our motivation is to reduce that amount while maintaining high-quality performance. We then asked if only the “right” words could be sent, ignoring other “non-material” words. This notion reminds us of the concept of file compression, where a smart and tailored algorithm is employed to reduce the size a file takes while maintaining its purpose and value.
在这里,我们介绍LLMLingua,这是 Microsoft 开发的一个项目,旨在解决信息“稀疏”的提示问题: 压缩它们。
Here, we introduce LLMLingua, a development by Microsoft that is meant to address prompts that are “sparse” in information by compressing them.
LLMLingua 使用紧凑、训练有素的语言模型,例如 LLaMA-7B,用于识别和删除提示中的非必要标记。这种方法可以利用 LLM 进行高效推理,以最小的性能损失实现高达 20 倍的压缩( https://github.com/microsoft/LLMLingua )。
LLMLingua utilizes a compact, well-trained language model, such as LLaMA-7B, to identify and remove non-essential tokens within prompts. This approach enables efficient inference with LLMs, achieving up to 20x compression with minimal performance loss (https://github.com/microsoft/LLMLingua).
在他们的论文(https://arxiv.org/abs/2310.05736和 https://arxiv.org/abs/2310.06839)中,作者解释了该算法及其提出的优点。有趣的是,除了降低成本之外,压缩还旨在集中剩余内容,作者表明,这可以提高LLMs的性能,因为它避免了稀疏和嘈杂的提示。
In their papers (https://arxiv.org/abs/2310.05736 and https://arxiv.org/abs/2310.06839), the authors explain the algorithm and the advantages it proposes. It is interesting to note that besides the reduction in cost, the compression also aims to focus the remaining content, which is shown by the authors to lead to an improvement in performance by the LLM, as it avoids a sparse and noisy prompt.
让我们在实际示例中尝试快速压缩,并评估其影响和各种权衡。
Let’s experiment with prompt compression in a real-world example and evaluate its impact and various trade-offs.
为了进行本实验,我们将说明一个现实世界的示例。
For the sake of this experiment, we'll illustrate a real-world example.
在我们当前的用例中,我们正在开发位于学术出版物数据库之上的一项功能。该功能允许用户选择特定的出版物并提出相关问题。后端引擎评估问题、审查出版物并得出答案。
In our current use case, we are developing a feature that sits on top of a database of academic publications. The feature allows the user to pick a specific publication and ask questions about it. A backend engine evaluates the question, reviews the publication, and derives an answer.
为了缩小特征范围以便进行一系列实验,这些出版物来自特定类别的人工智能出版物,用户提出的问题如下:
To narrow down the scope of the feature for the sake of putting together a series of experiments, the publications are from the particular category of AI publications, and the question that the user asks is the following:
“这个出版物涉及强化学习吗?”
"Does this publication involve Reinforcement Learning?" 这个问题需要对每一篇出版物进行深入而富有洞察力的审查,因为在某些情况下,出版物讨论了一种新颖的算法,其中术语强化学习在出版物中的任何地方都没有明确提及,但该功能预计可以从描述中推断出来算法是否确实利用了强化学习的概念并将其标记为这样。
This question requires a deep and insightful review of each publication, as there are cases where a publication discusses a novel algorithm where the term reinforcement learning isn’t explicitly mentioned at any point in the publication, yet the feature is expected to infer from the description of the algorithm whether it indeed leverages the concepts of reinforcement learning and flag it as such.
请参阅以下笔记本:Ch9_RAGLlamaIndex_Prompt_Compression.ipynb。
Refer to the following notebook: Ch9_RAGLlamaIndex_Prompt_Compression.ipynb.
在这段代码中,我们运行了一组实验,每个实验根据上述功能描述。每个实验都是以一个完整的、端到端的 RAG 任务。虽然我们在之前的 RAG 示例中使用了 LangChain,但这里我们引入了 LlamaIndex。LlamaIndex 是一个开源 Python 库,采用 RAG 框架 ( https://docs.llamaindex.ai/en/stable/index.html )。 LlamaIndex 在这方面与 LangChain 类似。
In this code, we run a set of experiments, each per the above feature description. Each experiment is in the form of a full, end-to-end RAG task. While we employed LangChain in the previous RAG examples, here, we introduce LlamaIndex. LlamaIndex is an open source Python library that employs a RAG framework (https://docs.llamaindex.ai/en/stable/index.html). LlamaIndex is similar to LangChain in that way.
Microsoft 人员整理的 LLMLingua 代码堆栈与 LlamaIndex 集成。
The LLMLingua code stack that the folks at Microsoft put together is integrated with LlamaIndex.
让我们详细回顾一下代码。
Let’s review the code in detail.
Similar to the previous notebooks, here too, we set the initial settings with the following:
我们借此机会强调,本次评估中的一些参数是固定的,目的是限制其复杂性并使其适合教育目的。在商业或学术环境中进行此类评估时,应该对所选值进行定性或定量推理。定性的形式可能是“由于预算限制,我们将所需的减少量固定为 999 个代币”,而定量的形式可能会寻求不修复它,而是将其作为其他权衡的一部分进行优化。在我们的例子中,我们修复了这个特定的参数值被发现可以实现令人印象深刻的压缩率,同时保持两种评估方法之间良好的一致性率。另一个例子是我们进行的实验数量。选择,这是运行时、GPU 内存分配和统计能力之间的权衡。
We take this opportunity to stress that some of the parameters in this evaluation were fixed for the sake of limiting its complexity and keeping it appropriate for educational purposes. When conducting such an evaluation in business or academic settings, there should be either qualitative or quantitative reasoning for the value chosen. Qualitative may be of the form “We shell fix the desired reduction to 999 tokens due to budget constraints," whereas quantitative may seek to not fix it but rather optimize it as a part of the other trade-offs. In our case, we fixed this particular parameter to a value that was found to allow for an impressive compression rate while maintaining a decent agreement rate between the two evaluated approaches. Another example was the number of experiments we chose, which was a trade-off between runtime, GPU memory allocation, and statistical power.
我们需要收集的数据集出版物,我们也会对其进行过滤,以便只留下人工智能类别中有限的出版物。
We need to gather the dataset of the publications, and we also filter it so as to be left with only the limited cohort of publications that are in the AI category.
Here, we set the ground for the two LLMs we will be employing.
压缩方法 LLMLingua 采用 Llama2 作为压缩 LLM。它将获取 LlamaIndex RAG 管道检索到的上下文、用户的问题,并且它将压缩并减小上下文内容的大小。
The compression method, LLMLingua, employs Llama2 as the compressing LLM. It will obtain the context retrieved by the LlamaIndex RAG pipeline, the user’s question, and it will compress and reduce the size of the context content.
OpenAI的GPT将作为下游的LLM进行提示,这意味着它将获取有关强化学习的问题以及额外的相关上下文并返回答案。
OpenAI’s GPT is to be used as the downstream LLM for prompting, meaning it will obtain the question about reinforcement learning and the additional relevant context and return an answer.
此外,在这里,我们定义了用户的问题。请注意,我们为 OpenAI 的 GPT 添加了有关如何呈现答案的说明。
Additionally, here, we define the user’s question. Note that we added instructions for OpenAI’s GPT on how to present the answer.
这是笔记本的核心。for循环迭代经过各种实验。在每次迭代中,都会评估两种场景:
This is the core of the notebook. A for loop iterates over the various experiments. In each iteration, two scenarios are evaluate:
当此代码单元完成时,我们有一个字典record,它保存每次迭代的相关值将用于汇总并得出结论。
When this code cell is completed, we have a dictionary, record, that holds the relevant values for each iteration that will be used to aggregate and derive conclusions.
在此,我们总结一下实验值并推断即时压缩对 LLM 性能、处理时间和API 成本的影响:
Here, we sum up the values of the experiments and deduce what impact the prompt compression has on the performance of the LLM, the processing time, and the cost of the API:
请注意,成本降低与协议率负相关,因为我们预计成本节省的增加会降低协议率。
Note that cost reduction is negatively dependent on the agreement rate, as we expect an increase in cost savings to reduce the agreement rate.
这种减少是显着的,在某些情况下,可能会使规模从亏损服务转向盈利服务。
This reduction is significant and, in some cases, may tilt the scale from a loss-making service to a profitable service.
以下是一些需要记住的关于分歧的含义和额外权衡的注意事项。关于两种方法之间的一致率下降,虽然两种方法之间的一致表明它们都是正确的,但分歧可能会发生。在第二种情况下,压缩可能会扭曲上下文,从而使模型无法对其进行正确分类。然而,事实可能恰恰相反,因为压缩可能减少了不相关的内容,并使LLMs专注于内容的相关方面,从而使压缩上下文的场景产生正确的答案。
Here are some notes to keep in mind regarding the meaning of a disagreement and additional trade-offs. Regarding the drop in agreement rate between the two approaches, while an agreement between the two approaches insinuates that they are both correct, a disagreement could go either way. It could be that in the second scenario, the compression distorted the context and, thus, made the model unable to properly classify it. However, the opposite may be true, as the compression may have reduced the irrelevant content and made the LLM focus on the relevant aspects of the content, thus making the scenario with the compressed context yield a correct answer.
关于额外的权衡,上述 LLM 性能、处理时间和 API 成本的指标并未揭示额外的考虑因素,例如压缩所需的计算资源。本地压缩 LLM(在我们的例子中为 Llama2)需要本地托管和本地 GPU。这些都是普通笔记本电脑上不存在的重要资源。请记住原始方法,即第一个场景,不需要这些。普通的 RAG 方法可以使用较小的 LM(例如基于 BERT 的 LM)甚至基于 API 的嵌入来执行嵌入。提示LLM,下我们最初的假设是选择远程且基于 API,从而使部署环境具有最少的计算资源,就像普通笔记本电脑所提供的那样。
Regarding additional trade-offs, the above metrics of LLM performance, processing time, and API cost don’t reveal additional considerations such as the computational resources that the compression requires. The local compressing LLM, in our case, Llama2, requires local hosting and local GPUs. These are non-trivial resources that don’t exist on an ordinary laptop. Remember the original approach, i.e., the first scenario, does not require those. An ordinary RAG approach can perform embeddings using either a smaller LM, such as one that is BERT-based, or even an API-based embedding. The prompted LLM, under our original assumption, is chosen to be remote and API-based, thus enabling the deployment environment to have minimal computation resources, like a common laptop would provide.
该评估证明 LLMLingua 即时压缩方法作为降低成本的手段非常有效且有用。
This evaluation proves that the LLMLingua prompt compression method is very impactful and useful as a means of cost reduction.
在本章的下一个也是最后一个代码演示中,我们将继续观察这次经验的结果,我们将通过组建一个由LLM担任的专家团队来实现,从而增强推导结论的过程。用于分析。
In the next and last code demonstration of this chapter, we will continue to observe the results of this experience, and we will do so by forming a team of experts, each played by an LLM, so as to enhance the process of deriving a conclusion to the analysis.
本节讨论LLMs领域最令人兴奋的最新方法之一,即同时采用多个LLMs。在本节中,我们寻求定义多个代理,每个代理均由LLMs支持,并被赋予不同的指定角色。而不是用户工作直接使用 LLM,正如我们在 ChatGPT 中看到的,在这里,用户设置多个 LLM,并通过为每个LLM 定义不同的系统提示来设置其角色。
This section deals with one of the most exciting recent methods in the world of LLMs, employing multiple LLMs simultaneously. In the context of this section, we seek to define multiple agents, each backed by an LLM and given a different designated role to play. Instead of the user working directly with the LLM, as we see in ChatGPT, here, the user sets up multiple LLMs and sets their role by defining a different system prompt for each of them.
就像人们工作一样在这里,我们也看到了同时聘用多个LLMs的优势。
Much like with people working together, here too, we see the advantages of employing several LLMs simultaneously.
一些优点如下:
Some advantages are the following:
在下面的例子中,我们将看到后者的例子。
In the following examples, we will see examples of the latter.
这可能会大大减少模型大小,因为当假设两种场景之间的性能相同时,几个专业LLMs的组合架构在尺寸上可能比一个通用LLMs的架构更小。
That may present a major reduction in model size, as the combined architecture of several specialized LLMs may be smaller in size than the architecture of one generic LLM when assuming equal performance between the two scenarios.
我们的具体框架本节中使用的称为 AutoGen,它由 Microsoft 提供(GitHub存储库: https: //github.com/microsoft/autogen/tree/main)。
The particular framework we employ in this section is called AutoGen, and it is made available by Microsoft (GitHub repo: https://github.com/microsoft/autogen/tree/main).
图 9 .1传达了 AutoGen 框架。以下内容来自GitHub 存储库中的声明:
Figure 9.1 conveys the AutoGen framework. The following was obtained from the statement made in the GitHub repo:
AutoGen 是一个框架,支持使用多个代理来开发 LLM 应用程序,这些代理可以相互对话来解决任务。AutoGen 代理是可定制的、可对话的,并且无缝地允许人类参与。他们可以采用LLMs、人力投入和工具组合的各种模式进行运作。
AutoGen is a framework that enables the development of LLM applications using multiple agents that can converse with each other to solve tasks. AutoGen agents are customizable, conversable, and seamlessly allow human participation. They can operate in various modes that employ combinations of LLMs, human inputs, and tools.
图 9.1 – AutoGen 功能
Figure 9.1 – AutoGen functionality
在图 9 .1的左侧,我们观察到各个代理的角色和能力的指定;在右侧,我们观察到一些可用的对话结构。
On the left of Figure 9.1, we observe the designation of roles and capabilities to individual agents; on the right, we observe a few of the conversation structures that are available.
代码存储库中介绍了 AutoGen 的关键功能:
AutoGen’s key capabilities as presented in the code repo:
AutoGen 由微软、宾夕法尼亚州立大学和华盛顿大学的合作研究提供支持。
AutoGen is powered by collaborative research studies from Microsoft, Penn State University, and the University of Washington.
接下来,我们可以深入研究代码中的实际示例。
Next, we can dive into a practical example in the code.
在这里,我们将展示一个团队如何多个代理,每个代理都有不同的指定角色,可以作为一个专业团队。我们选择的用例是我们之前运行的代码。在最后的代码中,我们对采用即时压缩进行了复杂的评估,当该代码完成时,我们得到了两个结果项:保存实验数值测量的字典,称为记录,以及关于所得一致性率的口头陈述、代币和成本的减少以及处理时间的变化。
Here, we will show how a team of multiple agents, each with a different designated role, could serve as a professional team. The use case we chose is a continuation of the previous code we ran. In the last code, we performed a complex evaluation of employing prompt compression, and when that code finished, we had two resulting items: the dict that holds the numeric measurements of the experiments, called record, and the verbal statements about the resulting agreement rate, the reduction in tokens and cost, and the change in processing time.
对于之前的笔记本,我们故意停了下来。我们没有想象到代币和成本的减少,也没有就是否提倡立即减少形成意见。然而,在商业或学术环境中,一个人需要同时提供两者。当您向利益相关者、决策者或研究界展示您的发现时,您应该在可行的情况下将实验的统计显着性可视化。作为 NLP 和 ML 领域的学科专家,您也是希望就是否采用实验方法向您提供建议。
With that previous notebook, we intentionally stopped short. We didn’t visualize the reduction in tokens and cost, and we didn’t form an opinion as to whether we would advocate for employing the prompt reductions. However, in business or academic settings, one would be required to offer both. When you present your findings to stakeholders, decision-makers, or the research community, you are expected, when feasible, to visualize the statistical significance of the experiments. As a subject expert in NLP and ML, you are also expected to provide your recommendation on whether to adopt the experimented method or not.
我们将从中得出结果评估后,我们将派出一个代理团队来为我们做这项工作!
We will take the results from that evaluation, and we will task a team of agents to do the work for us!
请参阅以下笔记本:Ch9_Completing_a_Complex_Analysis_with_a_Team_of_LLM_Agents.ipynb。该笔记本从安装、导入和设置的常见方面开始。您会注意到 AutoGen 具有字典形式的特定设置格式。他们提供了详细信息,正如您在我们的笔记本中看到的那样。
Refer to the following notebook: Ch9_Completing_a_Complex_Analysis_with_a_Team_of_LLM_Agents.ipynb. The notebook starts with the common aspects of installs, imports, and settings. You will notice that AutoGen has a particular format of settings in the form of a dictionary. They provide the details, as you can see in our notebook.
现在,我们进入有趣的部分!
Now, we move on to the interesting parts!
record.pickle 文件是一个dict变量。它是之前评估笔记本中数值结果的集合。我们的愿望是可视化每个令牌计数的分布实验。有原始提示的令牌计数和压缩提示的令牌计数。有还有每个实验的两者之间的比率。
The record.pickle file is of a dict variable. It is the collection of numerical results from the previous evaluation notebook. Our wish is to visualize the distributions of the token counts for each of the experiments. There are token counts for original prompts and token counts for compressed prompts. There are also the ratios between the two for each experiment.
在本节中,我们将组建一个团队来将代码放在一起,以可视化这三者中每一个的分布。
In this section, we'll form a team to put code together that would visualize the distributions of each of the three.
首先,我们定义了要完成的任务团队。我们告诉团队文件的保存位置以及dict中值的上下文和性质,从而使团队了解他们构思任务解决方案所需的信息。然后,我们描述创建绘图和可视化分布的任务。所有这些详细信息都在描述任务的一个字符串中。请注意,在敏捷 Scrum 工作环境中,此任务字符串与故事的目的类似。
First, we define the task to be fulfilled by the team. We tell the team where the file is saved and the context and the nature of the values in the dict, thus giving the team the understanding they need to ideate a solution to the task. Then, we describe the task of creating a plot and visualizing the distributions. All those details are in the one string that describes the task. Note that in an Agile Scrum work setting, this task string is similar to the purpose of the story.
现在我们已经形成了全面的描述,应该清楚预期的是什么。例如,我们要求对图形和轴进行标记,但我们没有明确说明标签是什么预期的。代理将自行理解,就像我们自己理解这一点一样,因为标签是从任务和数据字段名称推断出来的。
Now that we have formed a comprehensive description, it should be clear what is expected. For instance, we ask for the figures and axes to be labeled, but we don’t explicitly state what labels are expected. The agents will understand on their own, just as we would have understood this on our own, as the labels are inferred from the task and the data field names.
对于此任务,我们需要三名团队成员:一名程序员编写代码,一名 QA 工程师运行代码并提供反馈,以及一名团队负责人验证任务何时完成。
For this task, we would need three team members: a programmer to write the code, a QA engineer to run the code and provide feedback, and a team lead to verify when the task is complete.
对于每个角色,我们都阐述了一个系统提示。正如我们在第8章中了解到的,这个系统提示对LLM的功能有重大影响。请注意,我们还为 QA 工程师和团队负责人提供了自行运行代码的能力。通过这种方式,他们将能够验证程序员的代码并提供客观的反馈。如果我们告诉同一个代理人写代码并确认其正确性,我们可能会发现,在实践中,它会生成初稿,不会费心运行和验证它,并且它会在没有验证的情况下结束该任务。
For each of the roles, we articulate a system prompt. This system prompt, as we learned in Chapter 8, has a significant impact on the LLM’s function. Notice that we also provide the QA engineer and the team lead with the ability to run code on their own. In this way, they will be able to verify the programmer’s code and provide objective feedback. If we told the same agent to write the code and to confirm that it is correct, we might find that, in practice, it would generate a first draft, wouldn’t bother to run and verify it, and it would conclude that task without having verified it.
在这里,我们定义对话成为多主体对话;这是 AutoGen 的功能之一。这与您定义一系列对话的情况略有不同,其中每个对话仅涉及两个代理。群组对话涉及更多座席。
Here, we define the conversation to be a multi-agent conversation; this is one of the features of AutoGen. This is slightly different from the case where you define a series of conversations where each conversation involves just two agents. The group conversation involves more agents.
在定义群组对话时,我们还定义对话的管理者。
When defining a group conversation, we also define a manager for the conversation.
团队负责人向经理分配我们定义的任务。然后经理将工作委托给程序员和质量保证工程师。
The team lead tasks the manager with the task we defined. The manager then delegates the work to the programmer and the QA engineer.
以下是屏幕上显示的自动对话的亮点:
Here are the highlights of that automated conversation as it appears on the screen:
领导(至 manager_0): 请参阅此 [...] 中的 Python 字典 程序员(致manager_0): ````蟒蛇 将 pandas 导入为 pd 将 matplotlib.pyplot 导入为 plt # 从 URL 加载记录字典 导入请求 进口泡菜 [...] qa_engineer(致 manager_0): exitcode: 0 (执行成功) 代码输出: 图(640x480) 程序员(对 manager_0): 终止
lead (to manager_0): Refer to the Python dict that is in this [...] programmer (to manager_0): ```python import pandas as pd import matplotlib.pyplot as plt # Load the record dict from URL import requests import pickle [...] qa_engineer (to manager_0): exitcode: 0 (execution succeeded) Code output: Figure(640x480) programmer (to manager_0): TERMINATE
可以看出,对话有四次交互,每次交互都在两个代理之间。每次交互都首先告诉用户哪个代理正在与哪个其他代理交谈;这些部分在前面的打印输出中以粗体字母显示。
As can be seen, the conversation had four interactions, each between two agents. Each interaction starts by telling the user which agent is talking to which other agent; these parts are in bold letters in the preceding printout.
在第二次交互中,程序员提供了完整的Python脚本。为了保持简短,我们只粘贴了前四个命令,但您可以在笔记本中观察完整的脚本。QA 工程师运行了脚本并报告运行良好。如果它运行得不好,它会返回一个exitcode: 1并为程序员提供错误说明,以便程序员修复代码;对话将一直持续到找到解决方案为止,否则,团队将报告失败并结束对话。
In the second interaction, the programmer provided a complete Python script. We pasted only the first four commands to keep it short, but you can observe the full script in the notebook. The QA engineer ran the script and reported that it ran well. If it hadn’t run well, it would have returned an exitcode: 1 and would have provided the programmer with the error specification for the programmer to fix the code; the conversation would have continued until a solution was found, or, if not, the team would report failure and conclude the conversation.
此任务为我们提供了创建我们想要的视觉效果的代码。请注意,我们并没有要求代理运行代码并向我们提供视觉效果;而是要求代理运行代码。我们要求提供代码本身。如果需要的话,可以配置LLMs运行代码并向我们提供结果图像。有关各种示例和功能,请参阅 AutoGen 的存储库。
This task provided us with the code to create the visual we wanted. Note that we didn’t ask the agents to run the code and provide us with the visual; we asked for the code itself. One could, if desired, configure the LLMs to run the code and provide us with the resulting image. See AutoGen’s repo for the various examples and capabilities.
在下一个代码单元中,我们粘贴了团队创建的代码。该代码运行良好,并且完全按照我们向团队提出的要求可视化了三个分布(参见图 9 .2):
In the next code cell, we pasted the code that the team created. The code runs well and visualizes the three distributions exactly as we asked the team (see Figure 9.2):
图 9.2 – 可视化提示压缩提供的价值
Figure 9.2 – Visualizing the value that prompt compression provides
顶部可视化显示原始提示(蓝色/浅色阴影)和压缩提示(橙色/深色阴影)的令牌计数分布,图的底部显示每对提示之间的比例分布。图 9 .2显示了降低率的有效性,因为该比率转化为API 成本的降低。
The top visualization displays the distributions of the token count for the original prompts (blue/light shade) and the compressed prompts (orange/dark shade), and the bottom part of the figure shows the distribution of the ratio between each pair of prompts. Figure 9.2 shows just how effective the reduction rate is, as this ratio translates to a reduction in API cost.
这总结了实验意义的可视化。
This concludes the visualization of the significance of the experiments.
请注意,所有三个代理均由 LLM 驱动,从而使整个任务自动执行,无需人工操作干涉。人们可以更改潜在客户的配置来代表人类用户,即您。如果你这样做了,那么你就会能够干预并要求 QA 工程师进行某些验证或要求程序员对代码中的某些附加功能进行要求。
Note that all three agents are driven by LLMs, thus making this entire task automatically performed without human intervention. One could change the lead’s configuration to represent a human user, meaning you. If you did that, then you would be able to intervene and demand certain verifications from the QA engineer or certain additional features in the code from the programmer.
如果您想在自己的环境中自己运行代码而不是让 QA 工程师代理在自己的环境中运行代码,这可能特别有用。你们的环境不同。这样做的一个优点是当需要代码加载本地数据文件时。如果您告诉代理编写加载此文件的代码,那么当 QA 工程师代理运行它时,它会告诉您代码失败,因为该数据文件在其环境中不存在。在这种情况下,您可以选择与程序员一起迭代,并在迭代期间运行代码并提供反馈。
This could be particularly useful if you wanted to run the code yourself in your environment instead of letting the QA engineer agent run it in its own environment. Your environments are different. One advantage of doing this is when the code is required to load a data file that you have locally. If you told the agent to write code that loads this file, then when the QA engineer agent ran it, it would tell you the code failed since that data file doesn’t exist in its environment. In this case, you may elect to be the one who iterates with the programmer and the one who runs the code during the iterations and provides feedback.
您希望成为运行代码并提供反馈的另一种情况是,当 QA 工程师在程序员的代码中遇到错误或错误,但两个代理无法找出解决方案时。在这种情况下,您需要进行干预并提供您的见解。例如,在 for 循环迭代字典的键而不是其值的情况下,您可以干预并输入代码运行,但 for 循环正在迭代字典的键。它应该迭代键“key1”的值。
Another case where you would want to be the one running the code and providing feedback is when the QA engineer encounters an error or a bug in the programmer’s code, but the two agents aren’t able to figure out the solution. In that case, you would want to intervene and provide your insight. For instance, in a case where a for loop iterates over a dict’s keys instead of its values, you may intervene and enter The code runs but the for loop is iterating on the dict’s keys. It should iterate over its values for the key ‘key1.
我们现在可以进入结束评估的第二部分。
We can now move on to the second part of concluding the evaluation.
与我们进行实验以针对特定功能的影响的每一次复杂评估一样,我们现在希望得出结果的定性总结并提出结论我们的受众,无论是公司的决策者还是学术界的研究界。
As with every complex evaluation where we perform experiments to target the impact of a particular feature, we would now like to derive a qualitative summary of the results and suggest a conclusion for our audience, whether it is the decision-makers in the company or the research community in academia.
这部分的独特之处在于,推导结论的行为从未留给任何数学或算法模型来推导。由于我们人类控制着各种评估,尽管我们可能会寻求尽可能自动化以得出最终结论,但我们是形成最终印象和结论的实体。
What is unique about this part is that the act of deriving a conclusion has never been left to any mathematical or algorithmic model to derive. As we humans govern the various evaluations, and although we may seek to automate as much as possible to feed into the final conclusion, we are the entity that forms the final impression and conclusion.
在这里,我们尝试自动化最后一部分。我们将指派一个专家代理团队,对评估笔记本打印的结果提供有根据的总结。然后,我们将推动团队向我们提供关于是否应该实施即时压缩的新功能的建议。我们向团队提供了评估笔记本的实际结果,但为了检查其可靠性,我们再次对其进行任务,这次提供了较差得多的模拟结果,希望团队能够做出判断并提供不同的结果推荐。所有这一切都是在没有任何人为干预的情况下完成的。
Here, we attempt to automate that final part. We will assign a team of expert agents to provide an educated summary of the results that the evaluation notebook printed out. We'll then push the team to provide us with a recommendation as to whether we should implement the new feature of prompt compression or not. We provide the team with the actual results of the evaluation notebook, but in order to examine its reliability, we then task it again, this time providing it with mocked results that are much poorer, hoping that the team will apply judgment and provide a different recommendation. All of this is done without any human intervention.
正如我们之前所做的那样,我们首先定义团队要完成的任务。
As we did before, we start by defining the task for our team to fulfill.
我们的目标是提供团队与上一节的评估笔记本的打印输出。该打印输出用文字描述了一致率的变化、对提示标记数量的影响以及处理运行时间,所有这些都是由于采用了 LLMLingua 提示压缩方法。
Our aim is to provide the team with the printout of the evaluation notebook from the previous section. That printout describes, in words, the change in agreement rate, the impact on the number of prompt tokens, and the processing runtime, all due to employing the LLMLingua prompt compression method.
然后,我们从上一个笔记本中复制该内容并将其粘贴为文本字符串。
We then copy that from the previous notebook and paste it as a text string.
请注意,我们还创建了另一个结果文本字符串(这是比真实结果差得多的模拟结果),但我们看到一致率非常低,并且由于压缩而导致的令牌计数减少并不那么显着。
Note that we have also created another text string of results (which are mocked results that are much worse than the true results), but we see that the agreement rate is very low, and the reduction in token count due to compression is much less significant.
正如我们在可视化案例中所做的那样,然后我们为团队创建说明;我们将结果粘贴到任务描述中,供团队在得出结论时参考。我们有两个任务描述,因为我们将有两次单独的运行,一次是真实结果,另一次是与被嘲笑的糟糕结果。
As we did in the visualization case, we then create the instructions for the team; we paste the results into the task description for the team to refer to when deriving its conclusion. We have two task descriptions, as we will have two separate runs, one with the true results and one with the mocked bad results.
我们现在将分配角色。
We will now allocate the roles.
对于这个任务,我们将需要三名团队成员:a首席工程师是经验丰富的技术人员,技术作家根据首席工程师的反馈撰写结论,团队负责人验证任务何时完成(在上一个任务中定义)。
For this task, we would need three team members: a principal engineer who is an experienced technical person, a technical writer who writes the conclusion as per the principal engineer’s feedback, and a team lead to verify when the task is complete, which was defined in the previous task.
在这里,我们定义组对话,就像我们在可视化部分所做的那样。这次,我们有一个新的群组对话管理器,因为该群组由不同的代理组成。
Here, we define the group conversation, just like we did in the visualization part. This time, we have a new group conversation manager, as the group consists of different agents.
团队领导给经理分配任务与我们定义的任务。然后经理将工作委托给作者和首席工程师。
The team lead tasks the manager with the task we defined. The manager then delegates the work to the writer and the principal engineer.
以下是屏幕上显示的自动对话的亮点:
Here are the highlights of that automated conversation as it appears on the screen:
领导(经理_1): 请参阅下面打印的结果。 这些是来自 [...] writer(致 manager_1)的结果 : 使用 LLMLingua 进行即时压缩的实验产生了以下结果: - 分类性能: - 同意率[...] 主要工程师(对经理_1): [...]
lead (to manager_1): Refer to the results printed below. These are the results that stem from [...] writer (to manager_1): The experiments on prompt compression using LLMLingua have produced the following results: - Classification Performance: - Agreement rate of [...] principal_engineer (to manager_1): [...]
The agents have a few iterations between them and come to an agreement regarding the summary and the conclusion.
他们提供了数字结果的摘要,并附上以下建议:
They provide a summary of the numeric results and seal it with the following recommendation:
必须仔细考虑即时压缩带来的权衡,因为虽然它可能会节省资源,但可能会对处理效率产生影响。应在充分了解这些权衡后做出采用即时压缩的决定。
It is imperative to carefully consider the trade-offs presented by prompt compression, as while it may lead to resource savings, there might be implications on processing efficiency. The decision to adopt prompt compression should be made with a thorough understanding of these trade-offs.
该团队同意采取谨慎的方式来呈现各种权衡,并避免做出决定,尽管有任务这样做。
The team agrees on a cautious approach to presenting the various trade-offs and avoids making a decision in spite of being tasked to do so.
有人会问,是否可以在这里就明确决定采用或不采用这种方法呢?
One would wonder, could a definite decision to adopt or not to adopt the method be made here?
现在,我们将要求团队执行相同的操作,这次提供模拟结果,使压缩方法看起来效率低得多,并且与非压缩方法的分类的一致性大大降低。
Now, we will ask the team to perform the same action, this time providing it with the mocked results that make the compression method seem much less effective and with a great reduction in agreement with the classification of the noncompressed method.
团队进行了对话,最终协议摘要以以下声明密封:
The team has a conversation, and the final agreement summary is sealed with the following statement:
总体而言,结果表明,虽然即时压缩可能会节省成本和减少资源,但其代价是分类性能下降和处理时间显着增加。 **建议:**不建议使用 LLMLinguam 进行即时压缩**,因为它会对分类性能产生负面影响并显着增加处理时间,超过潜在的成本节省。
Overall, the results indicate that while prompt compression may lead to cost savings and resource reduction, it comes at the expense of decreased classification performance and significantly increased processing times. **Recommendation:** Prompt compression using LLMLinguam is **not recommended** as it can negatively impact classification performance and significantly increase processing times, outweighing the potential cost savings.
在这里,团队发现更容易得出明确的结论。它在没有任何人为干预的情况下做到了这一点仅基于给出的数值结果。
Here, the team found it much easier to draw a definite conclusion. It did so without any human intervention and solely based on the numerical results it was given.
这种新兴方法同时聘用多名LLMs的做法正在人工智能领域引起人们的兴趣和关注。在我们本节介绍的代码实验中,毫无疑问地证明了 AutoGen 的群组对话可以在专业环境中提供切实可行的价值。尽管设置这些代码实验需要进行一系列的试验和错误才能正确设置代理角色并正确描述任务,但这表明该框架正在朝着需要更少人工干预的方向发展。似乎仍然具有纪念意义组成部分是对这些代理团队工作成果的人工监督、反馈和评估。我们想向读者强调,在本书中分享的各种应用和创新中,我们将多代理框架标记为最有可能发展并成为最受欢迎的框架。这是基于行业对人工智能自动化和展示类人专业知识的压倒性期望,而 Microsoft 的 Autogen 和后来的 Autodev 等创新正在体现不断增长的可行性和能力。
This emerging method of simultaneously employing several LLMs is gaining interest and traction in the world of AI. In the code experiments that we present in this section, it was proven without a doubt that AutoGen’s group conversation can provide tangible and actionable value in the professional setting. Although setting these code experiments required a series of trials and errors for properly setting the agent roles and properly describing the tasks, it suggests that this framework is moving in a direction where less human intervention is required. What seems to remain a monumental component is the human oversight, feedback, and evaluation of the resulting relics of those agent teams’ work. We would like to stress to the reader that of the various application and innovations that we share in this book, we have marked the multiple-agent framework as the one that is most likely to grow and to also become the most popular. This is based on the overwhelming expectations that industries have from AI to automate and demonstrate human-like expertise, while innovations such as Autogen, and later Autodev, both by Microsoft, are exemplifying growing feasibility and competency.
在这一关键章节中,我们通过全面的 Python 代码示例对LLMs的最新和突破性应用进行了深入探索。我们首先使用 RAG 框架和 LangChain 解锁高级功能,增强特定领域任务的 LLM 性能。这一旅程继续以先进的链式方法进行复杂的格式化和处理,然后从不同的网络来源进行信息检索的自动化。我们还通过即时压缩技术解决了即时工程的优化问题,显着降低了 API 成本。最后,我们通过组建一个模型团队来探索LLMs的协作潜力,这些模型可以协同工作来解决复杂的问题。
Throughout this pivotal chapter, we have embarked on an in-depth exploration of the most recent and groundbreaking applications of LLMs, presented through comprehensive Python code examples. We began by unlocking advanced functionalities by using the RAG framework and LangChain, enhancing LLM performance for domain-specific tasks. The journey continued with advanced methods in chains for sophisticated formatting and processing, followed by the automation of information retrieval from diverse web sources. We also tackled the optimization of prompt engineering through prompt compression techniques, significantly reducing API costs. Finally, we ventured into the collaborative potential of LLMs by forming a team of models that work in concert to solve complex problems.
通过掌握这些主题,您现在已经获得了一套强大的技能,使您能够利用LLMs的力量来完成各种应用。这些新发现的能力不仅可以帮助您应对当前 NLP 领域的挑战,还可以帮助您获得创新见解并突破该领域的可能性。从本章中获得的实践知识将使您能够将先进的LLMs技术应用于现实世界的问题,为提高效率、创造力和解决问题提供新的机会。
By mastering these topics, you have now acquired a robust set of skills, enabling you to harness the power of LLMs for a variety of applications. These newfound abilities not only prepare you to tackle current challenges in NLP but also equip you with the insights to innovate and push the boundaries of what’s possible in the field. The practical knowledge gained from this chapter will empower you to apply advanced LLM techniques to real-world issues, opening up new opportunities for efficiency, creativity, and problem-solving.
当我们翻过这一页时,下一章将带我们进入人工智能和LLMs技术的新兴趋势领域。我们将深入研究最新的算法发展,评估其对各个业务部门的影响,并考虑人工智能的未来前景。即将举行的讨论将使您全面了解该领域的发展方向以及如何保持技术创新的前沿。
As we turn the page, the next chapter will take us into the realm of emerging trends in AI and LLM technology. We will delve into the latest algorithmic developments, assess their impact on various business sectors, and consider the future landscape of AI. This forthcoming discussion promises to provide you with a comprehensive understanding of where the field is headed and how you can stay at the forefront of technological innovation.
自然语言处理(NLP)和大语言模型(LLM)处于交叉点语言学的和人工智能,成为我们理解人机交互的里程碑。他们的故事始于基于规则的基本系统,尽管该系统在当时具有创新性,但由于人类语言的复杂细微差别和浩瀚性而常常陷入困境。这些系统的局限性凸显了转变的必要性,为适用于机器学习( ML ) 时代,其中数据和模式识别规定了设计和模型。
Natural language processing (NLP) and large language models (LLMs) stand at the intersection of linguistics and artificial intelligence, serving as milestones in our understanding of human-computer interactions. Their story begins with basic rule-based systems, which, while innovative for their time, often stumbled due to the complex nuances and immensity of human language. The limitations of these systems highlighted the need for a shift, paving the way for the machine learning (ML) era, where data and pattern recognition prescribe the design and the models.
在本章中,我们将回顾 NLP 和 LLM 领域出现的主要趋势,其中一些趋势足够广泛,足以捕捉整个人工智能的方向。我们将从定性的角度讨论这些趋势,旨在强调它们的目的、价值和影响。在接下来的部分中,我们将分享我们对未来的想法。我们希望激发您的好奇心,并激励您与我们一起探索这些新兴道路。
In this chapter, we will review key trends that have been emerging in NLP and LLMs, some of which are broad enough to capture the direction of AI as a whole. We will discuss those trends from a qualitative perspective as we aim to highlight their purpose, value, and impact. In the next sections, we’ll share our thoughts on what the future might look like. We hope to spark your curiosity and inspire you to explore these emerging paths with us.
让我们回顾一下本章涵盖的主要主题:
Let’s go through the main topics covered in the chapter:
让我们从技术趋势开始深入探讨我们所看到的许多趋势。
Let’s dive into the many trends we are seeing, starting with the technical ones.
在本节中,我们将介绍 NLP 和 LLM 领域的主要趋势。
In this section, we cover what we identify as key trends in the field of NLP and LLMs.
我们将从技术趋势开始,然后,我们将触及更软的文化趋势。
We will start with the technical trends, and later, we will touch on the softer cultural trends.
随着技术的进步,尤其是在计算领域的进步,许多技术领域都蓬勃发展,特别是 NLP 和 LLM。这不仅仅是更快的计算和更大的参数空间;这是关于新的可能性重塑我们的数字世界。在本节中,我们将探讨计算的增长如何成为当今 NLP 和 LLM 的基础,重点关注它们的目的、价值和影响力。
As technology has advanced, especially in computing, many areas in tech have thrived, particularly NLP and LLMs. It’s not just about faster calculations and bigger parameter space; it’s about new possibilities and reshaping our digital world. In this section, we’ll explore how this growth in computing has been foundational for NLP and LLMs today, focusing on their purpose, worth, and influence.
在人工智能和机器学习的初期,模型是初级的——不是因为缺乏想象力或意图,而是因为计算边界的限制。我们现在认为是基本的任务,例如简单的模式识别,是一项重大任务,因为它们需要高度复杂的算法来降低复杂性。在计算机科学课程中,我们被教导超过线性复杂度的算法具有较差的可持续性和不切实际的可扩展性。
In the initial days of AI and ML, the models were rudimentary—not due to a lack of imagination or intent, but because of restrictive computational boundaries. Tasks that we now consider basic, such as simple pattern recognitions, were significant undertakings, as they demanded great algorithmic sophistication to allow for low complexity. In computer science classes, we were taught that an algorithm with complexity beyond linear has poor sustainability and impractical scalability.
随着计算能力的增强,研究人员的野心也随之增强。他们不再局限于玩具问题或理论背景。计算的发展意味着他们现在可以设计和测试相当复杂和深度的模型,我们现在将其视为高级 NLP和 LLM 的先决条件。
As computational power grew, so did the ambition of researchers. No longer were they confined to toy problems or theoretical settings. The computational evolution meant they could now design and test models of considerable complexity and depth, which we now view as a prerequisite for advanced NLP and LLMs.
并行处理的出现图形处理单元(GPU )的发展标志着一个根本性的转变。由于被设计为同时处理多个操作,这些创新就好像是为 NLP 的需求量身定制的,允许训练神经网络等广泛的计算任务并促进实时处理。
The emergence of parallel processing and the development of graphics processing units (GPUs) marked a fundamental shift. Due to being designed to handle multiple operations simultaneously, it was as if these innovations were tailor-made for the demands of NLP, allowing for the training of extensive computation tasks such as neural networks and facilitating real-time processing.
计算能力没有只是改进可能的事情;它改变了实用的东西。训练大型模型在经济上变得可行,确保研究机构和公司可以试验、迭代和完善他们的模型,而无需高昂的成本。
Computation power didn’t just improve what was possible; it transformed what was practical. Training large models became economically feasible, ensuring that research institutions and companies could experiment, iterate, and refine their models without prohibitive costs.
数字时代带来了数据溢出。从这片信息海洋中有效地处理、解析和收集见解变得可行,这主要是由于计算能力的指数增长。这对于LLMs在广泛的数据集上进行自我训练、提取细微的语言模式并将其视为下游任务(例如预测和协助)的信号的能力发挥了重要作用。
The digital age has introduced an overflow of data. Efficiently processing, parsing, and gleaning insights from this ocean of information became viable primarily due to exponential growth in computation power. This has been instrumental in LLMs’ ability to self-train on extensive datasets, extracting nuanced linguistic patterns and treating them as signals for downstream tasks such as prediction and assistance.
当今的用户已经习惯了不断增长的处理速度,并且需要即时交互。无论是提供建议的数字助理还是人工智能驱动的客户服务平台,实时响应都是标准。增强的计算能力确保了过去需要几分钟甚至几小时才能完成的复杂 NLP 任务现在在终端设备上只需几秒钟即可完成。
Today’s users are becoming accustomed to a growing processing speed and they demand instant interaction. Whether it’s a digital assistant offering suggestions or an AI-driven customer service platform, real-time responses are a standard. Enhanced computational capacities have ensured that complex NLP tasks, which would have taken minutes, if not hours, in the past, are now completed within seconds on end devices.
计算方面的改进电力已经让人工智能驱动的界面成为常态。从网站上的聊天机器人到声控家庭助理,自然语言处理和LLMs在先进处理能力的推动下,已经成为日常生活的一部分。
The improvements in computational power have seen AI-driven interfaces become the norm. From chatbots on websites to voice-activated home assistants, NLP and LLMs, supercharged by advanced processing capabilities, have become a part of daily life.
人工智能已经进入艺术、文学和娱乐领域,由于 NLP/LLM 与计算能力之间的密切关系,人工智能驱动的内容创作者和音乐生成器等工具成为可能。
The domains of art, literature, and entertainment have seen AI’s ingress, with tools such as AI-driven content creators and music generators becoming possible due to the close relationship between NLP/LLMs and computational strength.
凭借处理不同语言数据的计算手段,NLP 模型现在提供多语言支持,打破语言障碍并促进全球数字包容性。2023 年,我们见证了一个重要的里程碑,Meta 发布了 SeamlessM4T,这是一种多语言 LLM,它是执行语音到文本、语音到语音、文本到语音和文本到文本翻译的单一模型多达 100 种语言;您可以在这里阅读更多相关信息:https://about.fb.com/news/2023/08/seamlessm4t-ai-translation-model/# :~:text=SeamlessM4T%20is%20the%20first%20all,languages% 20取决于%20在%20的%20任务上。
With the computational means to process diverse linguistic data, NLP models now offer multilingual support, breaking down language barriers and fostering global digital inclusivity. During 2023, we witnessed a major milestone when Meta released SeamlessM4T, a multi-lingual LLM that is a single model that performs speech-to-text, speech-to-speech, text-to-speech, and text-to-text translations for up to 100 languages; you can read more about this here: https://about.fb.com/news/2023/08/seamlessm4t-ai-translation-model/#:~:text=SeamlessM4T%20is%20the%20first%20all,languages%20depending%20on%20the%20task.
总而言之,这个关于计算能力及其与 NLP 和 LLM 关系的故事是一个相互成长和进化的故事。这个故事强调了硬件进步和软件创新之间的联系。展望未来,量子计算和神经形态芯片预示着计算飞跃的下一个前沿,人们只能想象自然语言处理和LLMs即将发生的进一步革命。目的、价值和影响我们所目睹的计算进步证明了它作为人工智能驱动的语言革命的基石的作用。
To conclude, this story of computational power and its relationship with NLP and LLMs is one of mutual growth and evolution. It’s a tale that underscores the bond between hardware advancements and software innovations. As we look onward, with quantum computing and neuromorphic chips suggesting the next frontier of computational leaps, one can only imagine the further revolutions in store for NLP and LLMs. The purpose, value, and impact of computational progress that we are witnessing are a testament to its role as the cornerstone of the AI-driven linguistic revolution.
现在,让我们看看事情的发展方向。
Now, let’s see where things are headed.
我们确定了几项进步,需要放置并推动人工智能(特别是自然语言处理)将利用的计算能力。
We identify several advancements that will take place and push computation power that will be leveraged by AI and, in particular, NLP.
摩尔定律传统上认为微芯片上的晶体管数量大约每两年就会增加一倍。虽然尽管有人对其传统意义上的可持续性进行了猜测,但它为估计计算能力的增长提供了有用的指导。芯片架构的进步,例如 3D 堆叠和创新晶体管设计,可能有助于维持甚至加速这种增长。
Moore’s law has traditionally held that the number of transistors on a microchip doubles approximately every two years. Although there’s speculation about its sustainability in the traditional sense, it provides a useful guide for estimating the growth in computational capability. Advancements in chip architecture, such as 3D stacking and innovative transistor designs, might help sustain or even accelerate this growth.
从翻译服务到语音助手,对实时 NLP 应用程序的需求将继续推动对更快计算速度的需求。我们正在见证人工智能专用硬件的新趋势。Google 于 2015 年发布了张量处理单元 (h ttps://spectrum.ieee.org/google-details-tensor-chip-powers ),从那时起,我们又看到了更多这样的专用硬件,无论是大公司、例如 Meta 和 Nvidia,或者小型新兴初创公司。
The need for real-time NLP applications, from translation services to voice assistants, will continue to drive demand for faster computational speeds. We are witnessing a new trend of AI-dedicated hardware. Google released the Tensor Processing Unit in 2015 (https://spectrum.ieee.org/google-details-tensor-chip-powers), and since then, we have seen several more such dedicated pieces of hardware by either big players, such as Meta and Nvidia, or by small emerging startups.
随着人工智能和自然语言处理变得更加丰富,这是科技巨头和初创公司投资更高效、可扩展和更具成本效益的计算基础设施的重要动力。
As AI and NLP become more abundant, there’s a significant incentive for tech giants and startups alike to invest in more efficient, scalable, and cost-effective computational infrastructure.
向云计算的过渡已经使得即使是小型初创公司也可以使用大量的计算资源。这种趋势可能会持续下去,每次计算的成本预计将减少,使 NLP 应用程序更容易访问和负担得起。
The transition to cloud computing has already made vast computational resources accessible to even small startups. This trend is likely to continue, with costs per computation expected to decrease, making NLP applications more accessible and affordable.
量子计算代表了我们理解和利用计算能力的方式的范式转变。量子位,或量子位,可以通过叠加现象同时表示 0 和 1,有可能为特定问题提供指数级的加速。
Quantum computing represents a paradigm shift in the way we understand and harness computational power. Quantum bits, or qubits, can represent both 0s and 1s simultaneously through the phenomenon of superposition, potentially offering exponential speedups for specific problems.
尽管量子计算正处于发展阶段,但它对 NLP 的潜在影响是深远的。目前训练复杂模型需要几天或几周的时间,现在可以减少到几小时甚至几分钟。
Although quantum computing is in its growing stages, its potential implications for NLP are profound. Training complex models, which currently takes days or weeks, could be reduced to hours or even minutes.
谷歌已成为量子计算领域的重要先锋(以下引用摘自此处:https://quantumai.google/learn/map):
Google has established itself as a significant spearheader in the world of quantum computing (The following quote is taken from here: https://quantumai.google/learn/map):
从大约 100 个物理量子位开始,我们可以研究构建逻辑量子位的不同方法。逻辑量子位允许我们无错误地存储量子数据足够长的时间,以便我们可以将它们用于复杂的计算。之后,我们将到达量子计算的晶体管时刻:我们证明该技术已准备好进行规模化和商业化的时刻。
Beginning with around 100 physical qubits, we can study different approaches to building logical qubits. A logical qubit allows us to store quantum data, without errors, long enough that we can use them for complex calculations. After that, we’ll reach quantum computing’s transistor moment: the moment that we demonstrate that the technology is ready to be scaled and commercialized.
谷歌起草了一份里程碑路线图,列出了未来主要成就的预测。见图10.1。值得注意的是,谷歌一直在坚持,这对于如此雄心勃勃的研究领域来说是令人惊讶的:
Google drafted a roadmap of milestones that laid out the future forecasts of key achievements. See Figure 10.1. It should be noted that Google has been adhering to it, which, for such an ambitious research field, is astonishing:
图 10.1 – 构建纠错量子计算机的关键里程碑
Figure 10.1 – Key milestones for building an error-corrected quantum computer
密码学的关键组成部分鉴于量子计算有可能打破几种现有的加密方法,对于基于云的 NLP 服务至关重要的安全数据传输也将发生巨大变化。因此,量子安全密码方法的兴起至关重要。
Cryptography, a key component in secure data transmission that is essential for cloud-based NLP services, will also undergo massive changes, given quantum computing’s potential to break several existing encryption methods. Thus, the rise of quantum-safe cryptographic methods will be vital.
随着对计算能力的需求增长,数据中心的能源消耗也在增长。将有双重驱动力更节能的计算和可持续能源来为这些计算工作提供动力。
As the demand for computational power grows, so does the energy consumption of data centers. There will be a dual drive towards more energy-efficient computation and sustainable energy sources for powering these computational efforts.
在 NLP 的背景下,这可能意味着更高效的模型架构,需要更少的能量来训练和运行,同时硬件创新可以最大化每瓦特的操作。
In the context of NLP, this might mean more efficient model architectures that require less energy to train and run, alongside hardware innovations that maximize operations per watt.
我们已经看到了用于深度学习的专用张量处理单元( TPU )的兴起。未来可能会有硬件专门针对 NLP 任务进行优化,确保更快、更高效的语言模型运行。
We’ve already seen the rise of specialized tensor processing units (TPUs) for DL. Going forward, there might be hardware specifically optimized for NLP tasks, ensuring faster and more efficient language model operations.
神经形态计算试图模仿人脑的架构,可能为 NLP 等需要逻辑和直觉相结合的任务提供独特的优势。戴维斯等人。回顾他们的出版物“与 Loihi 一起推进神经形态计算:结果和展望的调查”中的一些关键机会。 ”
Neuromorphic computing, which attempts to mimic the human brain’s architecture, may offer unique advantages for tasks such as NLP, which require a blend of logic and intuition. Davies et al. review some of the key opportunities in their publication “Advancing Neuromorphic Computing With Loihi: A Survey of Results and Outlook.”
随着边缘计算的进步由于日常设备中存在大量强大的处理器,高端 NLP 任务可能并不总是需要连接到集中式数据中心。先进的 NLP 功能有可能成为智能手机、智能家居设备甚至智能手表的标准配置。您将在您的个人设备上获得LLMs课程,该课程在本地运行并以与计算器相同的方式立即响应。
With advancements in edge computing and the abundance of powerful processors in everyday devices, high-end NLP tasks might not always require a connection to a centralized data center. Potentially, advanced NLP capabilities could become standard in smartphones, smart home devices, and even smartwatches. You will have an LLM available on your personal device, running locally and responding immediately in the same way as your calculator.
云平台在计算资源方面提供了前所未有的灵活性,使得训练更大规模、更大规模的数据变得更加容易。更复杂的NLP 模型。
Cloud platforms offer unprecedented flexibility in terms of computational resources, making it easier to train larger and more sophisticated NLP models.
AWS 的 SageMaker、微软 Azure 机器学习工作室和谷歌 Vertex AI 等平台培育了协作精神,为研究人员和开发人员提供了无缝共享模型、数据集和工具的工具。
Platforms such as AWS’s SageMaker, Microsoft’s Azure Machine Learning Studio, and Google’s Vertex AI have fostered a spirit of collaboration, giving researchers and developers tools to share models, datasets, and tools seamlessly.
本地、边缘和云计算的结合确保 NLP 任务得到高效处理,平衡延迟和计算能力。
The combination of local, edge, and cloud computation ensures that NLP tasks are handled efficiently, balancing both latency and computational power.
云平台不断发展,使高端计算能力更容易获得,其定价模型反映了实际使用情况,并以较低的成本提供临时的高性能计算访问。
Cloud platforms are evolving to make high-end computational power more accessible, with pricing models that reflect actual usage and offer temporary high-powered computational access at reduced costs.
总结我们对计算能力未来的看法,因为它与 NLP 相关,它显然处于上升轨道。尽管挑战仍然存在,特别是在能源消耗领域和传统芯片扩展的潜在障碍方面,但量子计算等创新有望打开大门,这些能力肯定会在专门的书籍中占有一席之地。
To conclude our view on the future of computational power, as it relates to NLP, it is clearly on an upward trajectory. While challenges remain, especially in the realms of energy consumption and the potential roadblocks in traditional chip scaling, innovations such as quantum computing promise to open doors to capabilities that will definitely get their own share of dedicated books.
计算的未来作为 NLP 运行引擎的力量看起来很光明,所以让我们讨论另一个工具组件:数据。
The future of computation power, which is the engine that NLP runs on, is looking bright, so let’s discuss another instrumental component: data.
大数据时代与随后NLP和LLM的兴起有着深刻的联系。在讨论 NLP 和 LLM 向当今强大发展的转变时,不能不提及可用的大量数据集。我们来探讨一下这种关系。
The era of big data and the subsequent rise of NLP and LLMs are deeply linked. The transformation of NLP and LLMs into today’s powerful developments cannot be discussed without mentioning the vast datasets that became available. Let’s explore this relationship.
从本质上讲,大型数据集的出现为训练日益复杂的数据提供了所需的原材料。楷模。通常,数据集越大,模型可以学习的信息就越全面和多样化。
At its core, the emergence of large datasets has provided the raw material required to train increasingly sophisticated models. Typically, the larger the dataset, the more comprehensive and diverse the information the model can learn from.
大型数据集不仅可以作为训练场,还可以提供评估模型性能的基准。这导致了标准化的测量,为研究人员提供了明确的目标,并允许在模型之间进行同类比较。有一系列常见的基准可用于评估LLMs。Google 创建了一个著名且非常全面的基准测试,即超越模仿游戏基准测试(BIG-bench)。它是一个基准,旨在评估LLMs的反应并推断他们未来的能力。它封装了 200 多个任务,例如阅读理解、总结、逻辑推理,甚至社会推理。
Large datasets not only serve as training grounds but also provide benchmarks for evaluating model performance. This has led to standardized measures, giving researchers clear targets and allowing for apples-to-apples comparisons between models. There is a collection of benchmarks that are common and can be used for evaluating LLMs. One famous and very comprehensive benchmark was created by Google, the Beyond the Imitation Game benchmark (BIG-bench). It is a benchmark designed to evaluate responses from LLMs and infer their future capabilities. It encapsulates over 200 tasks, such as reading comprehension, summarization, logical reasoning, and even social reasoning.
涵盖特定领域(例如医疗保健或法律文本)的大型数据集为可以高精度理解和在利基领域内运行的专用模型铺平了道路。例如,BERT 是由 Google 开发的,后来由 Hugging Face 免费提供。BERT 的设计采用了迁移学习;因此,它非常适合定制和创建专用于特定领域的模型的新版本。一些最成功的版本是 BERT-base-japanese,它是根据日本数据进行预训练的; BERTweet,其中接受了英语推文的预训练; FinBERT,根据金融数据进行了预训练。
Large datasets covering specific domains, such as healthcare or legal texts, pave the way for specialized models that can understand and operate within niche areas with high precision. For example, BERT was developed by Google and was later made available freely by Hugging Face. BERT’s design employs transfer learning; thus, it lends very well to customizing and creating a new version of the model that is dedicated to a particular domain. Some of the most successful versions are BERT-base-japanese, which was pre-trained on Japanese data; BERTweet, which was pre-trained on English tweets; and FinBERT, which was pre-trained on financial data.
有了更多的数据,模型就可以捕捉人类语言的更多细微差别和微妙之处。这些丰富的信息结果是可以更好地推广到各种任务的模型。
With more data, models can capture more nuances and subtleties of human language. This wealth of information results in models that can generalize better to a variety of tasks.
庞大且多样化的数据集的可用性确保模型能够针对各种语言、方言和文化背景进行训练。这推动 NLP 变得更具包容性,识别并响应更广泛的受众。
The availability of vast and varied datasets ensures that models are trained on a diverse range of languages, dialects, and cultural contexts. This has pushed NLP towards being more inclusive, recognizing and responding to a wider audience.
大型数据集在某种程度上消除了大量手动标记的需要。本书前面介绍的无监督和自监督学习模型充分利用了这种丰富的资源,节省了时间和金钱。
Large datasets negate the need for extensive manual labeling to some extent. Unsupervised and self-supervised learning models, which were covered earlier in the book, capitalize on this abundance, saving both time and money.
随着对大型数据集的开放访问,许多NLP研究领域的进入门槛已经降低。这导致了 NLP 的民主化,让更多的个人和组织能够进行创新。
With open access to large datasets, many barriers to entry in the NLP research field have been lowered. This has led to a democratization of NLP, with more individuals and organizations being able to innovate.
LLMs,例如GPT-3 和 BERT 的熟练程度归功于他们接受的广泛数据训练。这些模型被认为是最先进的,在各种 NLP 任务中树立了新的基准,这一切都归功于它们所训练的丰富数据集。
LLMs such as GPT-3 and BERT owe their proficiency to the extensive data they were trained on. These models, considered state-of-the-art, have set new benchmarks in various NLP tasks, all thanks to the rich datasets they were trained on.
由于多年来 NLP 主要是一个研究领域,一些适用于商业领域的法律方面并不适用。然而,随着这些模型的广泛使用和商业化的出现,它们所反映的大型数据集带来了可怕的担忧。这些数据集通常是从网络上抓取的,引发了有关隐私、数据所有权和潜在偏见的道德问题。这促使监管机构制定有关如何以合乎道德的方式获取和使用数据的指南。例如,在撰写本书时,我们注意到不同国家采取了几种不同的行动。日本很快就采取了非常自由的政策,允许根据在线数据对模型进行训练,而欧盟则一直在展示更加严格的方法。美国的官方指导方针似乎避免讨论版权争论。
As NLP was mainly a research field for so many years, some legal aspects that apply to the commercial domain weren’t applicable. However, as the vast usage and commercialization of these models have emerged, the large datasets that they reflect carry dire concerns. These datasets, which are often scraped from the web, have brought up ethical questions around privacy, data ownership, and potential biases. This caused regulators to work on guidelines regarding how to ethically source and use data. For example, as of the writing of this book, we have noticed several different actions by different nations. Japan has been quick to adopt a very liberal policy for allowing models to be trained on data available online, while the European Union has been demonstrating a more restrictive approach. The USA’s official guidelines seem to avoid addressing the copyright debate.
我们现在可以阐明一些数据的未来预测及其在发展LLMs中的作用。
We can now articulate some future projections for data and its role in developing LLMs.
未来,我们将会看到在解决各个方面和挑战的同时,数据如何持续增长。以下是关键点。
In the future, we will see how data continues to grow while the various aspects and challenges are addressed. Here are the pivotal points.
LLM 正在证明自己有能力且有利,重点是让他们变得熟练。我们可以增强LLMs以使其精通的几种方法之一是为其提供一个数据集,该数据集捕获其要服务的特定领域,并将LLMs作为该特定领域的专家来利用。未来,我们预计会培育更多利基、特定领域的数据集。无论是医疗保健、法律、金融还是任何专业领域,重点都将放在数据丰富性和特异性上,使模型能够获得无与伦比的领域专业知识。自从LLMs出现并日益流行以来,我们已经看到了一些定制LLMs来服务于特定业务领域的业务案例,其中医疗保健和金融领域受到了广泛关注。
As LLMs are proving themselves capable and favorable, an emphasis is being put on making them proficient. One of the several ways that we can enhance an LLM to become proficient is by providing it with a dataset that captures the particular domain that it is meant to serve and utilizing the LLM as an expert in that particular domain. In the future, we anticipate the cultivation of more niche, domain-specific datasets. Whether it’s healthcare, law, finance, or any specialized field, the emphasis will be on data richness and specificity, enabling models to achieve unparalleled domain expertise. Since the emergence and growing popularity of LLMs, we have seen several such business cases of customizing LLMs to serve a particular business domain, with healthcare and finance gaining a lot of attention.
相反,随着不同领域的重叠,集成数据集就会出现。这些数据集结合了多个领域的专业知识。例如,数据集可能会将法律和人工智能伦理交织在一起,试图提出促进人工智能监管的新见解。另一个例子是将计算机代码和股票交易联系起来,以形成算法交易方案。
Conversely, as different domains overlap, integrated datasets emerge. These are datasets combining expertise from multiple fields. For instance, a dataset may intertwine law and AI ethics in an attempt to suggest novel insights promoting regulations around AI. Another example is linking computer code and stock trading for the sake of forming an algorithmic trading scheme.
随着技术范围的扩大,数据集将越来越多地涵盖鲜为人知的语言和地方方言。这将使 NLP 能够迎合更广泛的全球受众,使数字通信更具包容性。我们在本章前面讨论的 Meta 的 SeamlessM4T 是能够通过 LLM 进行跨语言对话的绝佳示例。
As technology expands its reach, datasets will increasingly encompass lesser-known languages and regional dialects. This will allow NLP to cater to a broader global audience, making digital communication more inclusive. Meta’s SeamlessM4T, which we discussed earlier in this chapter, is a terrific example of being able to converse across languages via LLM.
除了语言之外,语言还具有文化方面的因素,例如行话或词语的选择。捕捉文化的细微差别和背景将成为未来文本生成的首要任务。这将带来更具文化意识和情境感知的模型。
Beyond just language, there is also the cultural aspect to a language, such as jargon or the mere choice of words. Capturing the cultural nuances and context will become paramount in future text generation. This will lead to more culturally conscious and context-aware models.
在认识到我们的数字内容中存在的隐性偏见时,用于审核数据集的工具和方法将会激增。偏见。社区将努力获得既大又公平的数据集。我们不会盲目地抓取网络数据,而是会花更多的精力来整理数据,确保其具有代表性并且没有明显的偏见。这可能包括积极寻求代表性不足的声音或过滤掉潜在有害的偏见。
In recognizing the implicit biases present in our digital content, there will be a surge in tools and methodologies to audit datasets for biases. The community will strive for datasets that are both large and fair. Instead of blindly scraping the web, more effort will go into curating data, ensuring it’s representative and free from evident prejudices. This might include actively seeking underrepresented voices or filtering out potentially harmful biases.
随着人们对数据隐私的担忧日益增加,特别是在欧盟的 GDPR 和加利福尼亚州的 CCPA,我们可以期待关于如何收集和使用数据集的更严格的指导方针。
With growing concerns about data privacy, especially in the European Union with GDPR and in California with CCPA, we can expect stricter guidelines on how datasets can be collected and utilized.
除了隐私之外,还将推动以更合乎道德的方式收集数据。这意味着确保数据的收集不受剥削、经过适当同意并尊重个人和社区的权利。
Beyond privacy, there will be a push for more ethical ways to gather data. This means ensuring data are collected without exploitation, with proper consent, and with respect to the rights of individuals and communities.
本着可重复研究的精神,可能会推动数据集(尤其是用于基准测试和主要模型的数据集)变得更加透明和开放。当然,这必须与隐私问题相平衡。
In the spirit of reproducible research, there might be a drive towards making datasets, especially those used for benchmarking and major models, more transparent and open. This would have to be balanced, of course, with privacy concerns.
在数字化环境中,创造真正新的、独特的数据是一项非凡的任务,增强的数据集提供了另一种解决方案。通过人为扩展和修改现有数据集,增强可以迅速满足对多样化数据日益增长的需求,而无需进行详尽的新数据收集过程。增强数据集有助于利用数据集解决以下四个挑战:
In a digital landscape, where creating genuinely new and unique data is an extraordinary task, augmented datasets present an alternative solution. By artificially expanding and modifying existing datasets, augmentation can swiftly cater to the growing hunger for diverse data without the exhaustive process of fresh data collection. Augmented datasets help to tackle these four challenges with datasets:
尽管如此,虽然增强数据集为许多与数据相关的挑战提供了创新的解决方案,但它们并非没有缺点。原则上,过度依赖增强可能会导致模型擅长识别人工模式,但无法应对现实世界的变化。如果原始数据集存在未解释的偏差,则还存在无意中放大偏差的风险。此外,并非所有增强技术都普遍适用。适用于一个数据集的方法可能会扭曲另一个数据集。最后,围绕创建合成数据存在伦理争论,尤其是在敏感领域,真实数据和增强数据之间的区别可能会模糊基本事实。
Nonetheless, while augmented datasets offer innovative solutions to many data-related challenges, they aren’t without shortcomings. In principle, over-reliance on augmentation can lead to models that are adept at recognizing artificial patterns but fail with real-world variability. There’s also the risk of inadvertently amplifying biases if the original datasets had unaccounted skews. Furthermore, not all augmentation techniques are universally applicable; what works for one dataset might distort another. Lastly, there’s the ethical debate around creating synthetic data, especially in sensitive fields, where the distinction between real and augmented could blur essential truths.
为了总结我们在 NLP 和 AI 背景下的数据覆盖范围,我们观察了大型数据集的可用性如何彻底改变了 NLP 领域和LLMs的发展。他们为现代 NLP 的宏伟建立奠定了基础,塑造了它的目的,放大了它的价值,并对研究、应用和整个社会产生了持久的影响。
To conclude our coverage of data in the context of NLP and AI, we observe how the availability of large datasets has revolutionized the domain of NLP and the development of LLMs. They’ve provided the foundation upon which the magnificent establishment of modern NLP stands, shaping its purpose, magnifying its value, and leaving a lasting impact on research, applications, and society at large.
在地平线上,同样大数据集继续塑造 NLP 的世界,我们期待的未来不仅是数据丰富,而且具有道德意识、特定领域和全球包容性。这些趋势源自当前网络文章和出版物的集体智慧,描绘了 NLP 数据驱动的未来旅程的前景。
On the horizon, as large datasets continue to shape the world of NLP, we are looking at a future that’s not just data-rich but also ethically conscious, domain-specific, and globally inclusive. These trends, sourced from the collective wisdom of current web articles and publications, paint a promising picture of NLP’s data-driven journey ahead.
现在我们已经讨论了驱动算法创建的计算能力和指导LLMs智能的数据,我们可以考虑LLMs本身。
Now that we have discussed the computation power that drives the creation of the algorithms, and the data, which guides the LLMs’ intelligence, we can consider the LLMs themselves.
LLMs的兴起和发展证明了我们对更先进算法的不懈追求。这些巨大的计算语言学模型从最初的发展到现在已经走过了很长的路要走化身,不仅在规模上不断增长,而且在能力上也不断增长。当我们深入研究这些强大工具的目的、价值和影响时,很明显,它们的演变与我们利用机器驱动的通信和认知的真正潜力的愿望密切相关。
The rise and development of LLMs stand as a testament to our relentless pursuit of more advanced algorithms. These giant computational linguistics models have come a long way from their initial incarnations, growing not only in size but also in capabilities. As we delve into the purpose, value, and impact of these formidable tools, it becomes clear that their evolution is closely intertwined with our aspiration to harness the true potential of machine-driven communication and cognition.
背后的理由LLMs的发展围绕着弥合人类和机器通信之间差距的探索,其中人类语言将被输入机器进行下游处理。随着数字时代的开始,对能够通过细致入微的理解来掌握人类语言的流畅、上下文感知和智能系统的需求变得显而易见。正如前面章节中广泛讨论的那样,深度学习代表了LLMs的基础。随着计算能力的扩展,深度学习模型的深度和复杂性不断增强,从而提高了各种任务(尤其是 NLP)的性能。
The rationale behind the development of LLMs revolves around the quest to bridge the gap between human and machine communication, where human language is to be fed into a machine for downstream processing. As the digital age began, the need for fluid, context-aware, and intelligent systems that could grasp human language with nuanced understanding became apparent. As was covered extensively in prior chapters, DL represents the foundation of LLMs. As computational capabilities expanded, DL models grew in depth and complexity, leading to enhanced performance in various tasks, especially NLP.
深度学习模型的传统训练依赖于需要标记数据的监督学习,而标记数据又是资源密集且有限。自监督学习和人类反馈强化学习(RLHF )等方法的出现拓宽了视野。这些方法不仅最大限度地减少了显式标记的需要,而且还为模型更有机地学习打开了大门,反映了人类的学习过程。
The traditional training of DL models relies on supervised learning that requires labeled data, which, in turn, is both resource-intensive and limiting. The emergence of self-supervised learning and methods such as reinforcement learning from human feedback (RLHF) broadened horizons. These methods not only minimized the need for explicit labeling but also opened doors for models to learn more organically, mirroring human learning processes.
早期的 NLP 模型可以回答提出问题或执行任务时,焦点范围很窄。LLMs的发展带来了范式转变,模型开始展现推理能力,遵循一系列思想,并产生连贯、更长的响应。这是复制类人对话的重要一步。早期模型的通用方法有其局限性。随着技术的成熟,针对特定任务定制LLMs的能力出现了。设置检索数据集或微调预训练模型等技术使企业和研究人员能够将通用LLMs塑造成专门的工具,从而提高准确性和实用性。
Early NLP models could answer questions or perform tasks with a narrow focus. The evolution in LLMs brought a paradigm shift where models began exhibiting reasoning abilities, following a chain of thought, and producing coherent, longer responses. This was a significant step towards replicating human-like conversation. The generic approach of earlier models had its limitations. As the technology matured, the ability to tailor LLMs to specific tasks emerged. Techniques such as setting up retrieval datasets or fine-tuning pre-trained models allowed businesses and researchers to mold generic LLMs into specialized tools, enhancing both accuracy and utility.
LLMs随着其发展,在多个领域带来了前所未有的价值。它们变得更加准确、高效、适应性强且可定制。
LLMs, with their evolution, brought forth unprecedented value in multiple domains. They become more accurate, efficient, adaptable, and customizable.
展示了更大的模型掌握上下文的内在能力,减少解释和输出中的错误。这种准确性转化为聊天机器人和内容创建等各种应用程序的效率。他们通过利用 RLHF 等出色的技术来适应,这使他们能够从互动和反馈中学习,随着时间的推移,使他们变得更有弹性和活力。通过可定制,LLMs可以迎合利基行业和任务,使其成为跨不同行业的宝贵资产。
Larger models demonstrated an intrinsic ability to grasp context, reducing errors in interpretation and output. This accuracy translated to efficiency in various applications such as chatbots and content creation. They adapt by leveraging brilliant techniques such as RLHF, which enables them to learn from interactions and feedback, making them more resilient and dynamic over time. By being customizable, LLMs could cater to niche industries and tasks, making them invaluable assets across diverse sectors.
我们可以看到的另一个不断增长的价值是打破语言障碍的能力,因为模型可以理解并生成多种语言,从而满足全球通用通信的愿望。
Another value that we can see growing is the ability to break language barriers, as the models understand and generate multiple languages, tapping into the global aspiration of universal communication.
LLMs的兴起和发展在科技领域以及人类与机器的互动中留下了永久的印记。从医疗保健和金融到娱乐和教育,LLMs正在发生革命性的变化运营、客户互动和数据分析。有趣的是,随着这些模型变得更加复杂,它们的使用变得不那么具有挑战性。随着更直观和自然的语言界面的出现,对技术敏锐度的要求正在变得低得多,更广泛的人群,无论他们的技术知识如何,现在都可以利用先进计算工具的力量。
The rise and evolution of LLMs have left a permanent mark on the tech landscape and human interaction with machines. From healthcare and finance to entertainment and education, LLMs are revolutionizing operations, customer interactions, and data analyses. Interestingly, as these models become more complex, their use becomes less challenging. Tech acumen is becoming a much lower requirement, as with more intuitive and natural language interfaces, a broader demographic, irrespective of their technical know-how, can now harness the power of advanced computational tools.
这些影响因素是有凝聚力的数字生态系统的一部分。随着LLMs跨平台和服务的集成,我们正在见证更加有组织和同步的数字生态系统的创建,这些生态系统可提供无缝的用户体验。
These elements of impact are a part of an onset of cohesive digital ecosystems. As LLMs integrate across platforms and services, we’re witnessing the creation of more organized and synchronized digital ecosystems that offer seamless user experiences.
思考LLMs的下一步发展是令人兴奋的。
It is exciting to think about where things are headed next with LLMs.
LLMs的快速发展预示着一个充满创新的未来。根据当前的研究趋势、在线出版物和专家预测,我们可以预测LLMs的几个方向设计可能会成为主导。
The rapid evolution of LLMs promises a future teeming with innovations. Drawing from current research trends, online publications, and expert predictions, we can forecast several directions in which LLM design might be headed.
正如我们所见,自我监督学习和 RLHF 改变了LLMs的游戏规则。下一个前沿可能涉及结合各种学习范式或引入更新的范式。随着深度学习技术的进步,我们可能会看到更多的混合模型,它们集成了不同架构的最佳属性,以提高性能、泛化性和效率。
As we’ve seen, self-supervised learning and RLHF have changed the game for LLMs. The next frontier could involve combining various learning paradigms or introducing newer ones. With the advancement of DL techniques, we might see more hybrid models that integrate the best attributes of different architectures to improve performance, generalization, and efficiency.
Palantir 的首席技术官 Shyam Sankar 在描述他们的 K-LLM 方法时阐述了同时聘用多名LLMs的例子。他将LLMs比作专家,并问道,当一个委员会可以聚集在一起来回答某个问题时,为什么要使用一个专家来回答这个问题呢?他建议使用不同LLMs的组合,每个LLMs可能都有互补的优势,以便能够综合出一个经过更仔细考虑的答案。应该强调的是,在这个想法中,每个LLMs都承担着相同的任务。情况不一定如此,在下一个方法中,我们将讨论相反的情况。请在此处查看完整视频:https://youtu.be/4aKN5mCPF5A?si =kThpx8hOok1i0QWC&t= 327。
An example of employing several LLMs simultaneously was articulated by Palantir’s CTO, Shyam Sankar, as he described their K-LLMs approach. He assimilated LLMs to experts and asked why a single expert would be used to answer a question when a committee could be put together to all pitch in to answer that question? He suggested using an ensemble of different LLMs, each perhaps with complementing strengths, so as to be able to synthesize an answer that is more carefullly considered. It should be stressed that in this idea, each LLM is tasked with the same task. This doesn’t have to be the case, and in the next approach, we will discuss the opposite. See the full video here: https://youtu.be/4aKN5mCPF5A?si=kThpx8hOok1i0QWC&t=327.
同化专家团队的另一种方法是模拟专业团队。这里,指定了分配给LLMs的角色。然后,每个指定的角色依次处理该任务。每个角色既解决任务,也解决已完成工作的遗留问题由之前的其他角色。这样,就有了一种迭代方法来为复杂问题构建深思熟虑的解决方案。我们在第 9 章的示例中看到了这个令人着迷的过程,其中我们利用了Microsoft 的 Autogen。
Another approach to assimilating a team of experts is by simulating a professional team. Here, there are designated roles assigned to the LLM. The task is then addressed by each of the designated roles in turn. Each role addresses both the task but also the relic of the work that was done by other roles before it. This way, there is an iterative approach to building out a thoughtful solution to a complex problem. We have seen this fascinating process in our example from Chapter 9, where we leveraged Microsoft’s Autogen.
有效地促进LLMs已成为一门微妙的艺术和科学,称为即时工程。随着模型的增长,手动制作每个查询都可能变得不可行。未来可能会出现自动或半自动的方法来生成提示,确保一致和期望的输出。我们的目标是使LLMs更加人性化,最大限度地减少与他们有效互动所需的专业知识。
Prompting LLMs effectively has become a subtle art and science known as prompt engineering. As models grow, manually crafting every query might become infeasible. The future could see automated or semi-automated methods to generate prompts, ensuring consistent and desired outputs. The push would be towards making LLMs more user-friendly, minimizing the need for specialized knowledge to interact with them effectively.
在第 8 章中,我们介绍了即时工程的一些关键方面。我们解释了如何利用 OpenAI 的 GPT 模型来利用技术功能(例如系统提示)。有趣的是,有一些非技术方面可以促进工程,这对于实现最佳LLMs结果同样有价值。当我们说非技术性时,我们指的是诸如在提示中对请求的连贯描述之类的方面,就像我们向寻求帮助我们的人提供的那样。
In Chapter 8, we covered some of the key aspects of prompt engineering. We explained how a technical feature, such as a system prompt, can be leveraged with OpenAI’s GPT models. What’s interesting is that there are non-technical aspects to prompt engineering that are just as valuable to achieving optimal LLM results. When we say non-technical, we mean aspects such as a coherent description of the request within the prompt, just as we would provide to a human who would seek to help us.
我们期待看到更多微妙的提示技术,如提示链和软提示。提示链是提示迭代,其中复杂的任务被分解为小任务,每个任务都反映在一个小提示中。这可以实现更好的依从性、正确性和监控。软提示是一种算法技术,旨在微调代表提示的向量。
We are expecting to see further subtle techniques in prompting, as seen with prompt chains and soft prompting. Prompt chains are prompt iterations where a complex task is broken into small tasks and each is reflected in a small prompt. This allows for greater adherence, correctness, and monitoring. Soft prompting is an algorithmic technique that seeks to fine-tune the vectors representing the prompt.
其中一个令人着迷的例子是C 的大型语言模型作为优化器。杨等。等;请参阅Arxiv上的出版物:https://arxiv.org/abs/2309.03409。他们发现,鼓励LLMs强调对解决方案的深思熟虑会产生更好的表现。如果我们假设LLMs只有一个继承过程来解决每个特定问题,这可能听起来令人惊讶。例如,如果我们要求它求解一个方程,人们可以假设LLMs将采用一种特定的数学技术,但是如果一个复杂的问题需要分解为一系列逐步任务,其中的结构既不该系列很琐碎,也没有每个任务的解决方法?通过命令LLMs专注于优化而不仅仅是结果而且还有推导,这个过程改善了结果。这是通过添加如下请求来完成的:
One such fascinating example is Large Language Models as Optimizers by C. Yang et. al.; see the publication on Arxiv: https://arxiv.org/abs/2309.03409. They found that encouraging the LLM to put emphasis on the thoughtfulness it gives to the solution yielded better performance. That may sound surprising if we assume that the LLM has just a single inherited process to solve every particular problem. For example, if we were to ask it to solve an equation, one could assume the LLM would employ one particular mathematical technique, but what about a complex question that requires being broken down into a series of step-wise tasks where neither the structure of the series is trivial, nor the solution methods for each of the tasks? By ordering the LLM to focus on optimizing not just the outcome but also the derivation, this process improves the outcome. This is done by adding a request such as the following:
这些全部摘自出版物。其中最引人注目的是:
These were all taken from the publication. The one that stood out the most was this:
他们的研究表明,虽然LLMs显然不会喘口气,但它理解提示中的这一补充强调推导过程的重要性。
Their research suggests that while an LLM clearly doesn’t take breaths, it understands this addition to the prompt as an emphasis on the importance of the derivation process.
我们借此机会再次讨论 NLP 领域的重要新范式,我们预计该范式将在明年继续大量涌现:RAG。
We take this opportunity to discuss, again, the significant new paradigm in the world of NLP that we expect will continue to emerge greatly in the next year: RAGs.
正如我们所见,LLMs驱动的生成式人工智能擅长生成详细且易于理解的文本响应基于对大量数据的广泛训练。然而,这些响应仅限于人工智能的训练数据。如果LLMs的数据已过时或缺乏有关某个主题的具体细节,则可能无法产生准确或相关的答案。
As we witness, generative AI driven by LLMs is proficient at producing detailed and easy-to-understand textual responses based on extensive training over vast corpora of data. However, these responses are limited to the AI’s training data. If the LLM’s data is outdated or lacks specific details about a topic, it may not produce accurate or relevant answers.
检索增强生成,也称为RAG,通过集成有针对性的、当前的、甚至动态的信息来增强LLMs的能力,而不改变LLMs。 P. Lewis 等人在 2020 年的一篇论文中介绍了该方法。称为知识密集型 NLP 任务的检索增强生成,请参阅Arxiv:https://arxiv.org/abs/2005.11401。
Retrieval-augmented generation, also known as RAG, enhances the LLM’s capabilities by integrating targeted, current, and perhaps even dynamic information without altering the LLMs. This method was introduced in a 2020 paper by P. Lewis et al. called Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, see on Arxiv: https://arxiv.org/abs/2005.11401.
在第 8 章和第 9 章中,我们从实践的角度研究了 RAG,为读者提供了动手实验和实施所需的工具和知识。当我们重新审视 RAG 时,我们的重点转向研究它们在 NLP 和 LLM 发展的更广泛叙述中的重要性。这次讨论是在定性、概念性的背景下进行的,探讨了算法进步的不断发展的趋势和未来的方向。我们的目标是将 RAG 不仅作为一种技术工具,而且作为LLMs持续发展的关键组成部分,强调它们在塑造下一代人工智能解决方案中的作用。这一探索旨在弥合技术结合理论,深入了解 RAG 如何对人工智能研究和应用的动态格局做出贡献并受其影响。
In Chapters 8 and 9, we studied RAGs from a practical standpoint, equipping readers with the necessary tools and knowledge for hands-on experimentation and implementation. As we revisit RAGs, our focus shifts towards examining their significance within the broader narrative of NLP and LLM development. This discussion is framed within a qualitative, conceptual context that explores the evolving trends and future directions of algorithmic advancements. Our aim is to contextualize RAGs not just as a technological tool but as a pivotal component in the ongoing evolution of LLMs, highlighting their role in shaping the next generation of AI solutions. This exploration seeks to bridge the technical with the theoretical, offering insights into how RAGs contribute to and are influenced by the dynamic landscape of AI research and application.
为了直观起见,请考虑以下示例。让我们看一些编程语言;它可以是 Python、R、C++ 或任何其他通用语言。它带有继承的“知识”,即内置的库和函数。如果您构建代码来执行基本数学或形成排序列表,您会发现编程语言的当前状态适合您,因为它具有包含您所需的所有功能的内置代码库。但是,当您希望执行一些与常见的库及其功能集极其不同的操作时,该怎么办?例如,翻译一个外语到英语、计算傅立叶变换或执行图像分类。假设,人们可以寻求开发一种全新的专用编程语言,其内置库的内在集包括它们所需的所有功能。相反,人们可以简单地构建一个专用库并将其导入到他们的编程语言环境中。通过这种方式,您的代码只需检索必要的函数。显然,这就是通用编程语言的工作方式,这是两者中最简单且最具可扩展性的解决方案。这就是 RAG 在LLMs背景下寻求实现的目标。LLM类似于编程语言,从外部数据源检索信息类似于导入专用库。
For intuition, think about the following example. Let’s take some programming language; it could be either Python, R, C++ or any other general-purpose language. It comes with its inherited “knowledge,” which is the built-in libraries and functions. If you build code to perform basic math or form a sorted list, you’ll find that the current state of the programming language suits you, as it has built-in code libraries with all the functions you require. However, how about when you are looking to perform some action that is extremely different from the common set of libraries and their functions? For instance, translate a foreign language to English, calculate a Fourier transform, or perform image classification. One could, hypothetically, seek to develop a whole new dedicated programming language for which the intrinsic set of built-in libraries includes all the functionality that they require. Conversely, one might simply build a dedicated library and import it into their programming language’s environment. In this way, your code simply retrieves the necessary functions. Clearly, that is the way general-purpose programming languages work, which is the easiest and most scalable solution of the two. That is what RAGs seek to achieve in the context of LLMs. The LLM is analogous to the programming language, and the retrieval of the information from an external data source is analogous to importing a dedicated library.
让我们在进一步回顾 RAG 时观察图 10 .2 。
Let’s observe Figure 10.2 as we review RAGs a little more.
图 10.2 – 典型 RAG 的流程图
Figure 10.2 – A flow diagram of a typical RAG
RAG 的工作原理
How RAGs work
The following are the pillars for RAG functionality:
这个机制可能看起来你很熟悉。我们在第 8 章和第9章中实现了这一范例,当时我们介绍了 LangChain 的功能并设计了从外部文件检索文本的管道。
This mechanism may seem familiar to you. We implemented this paradigm in Chapters 8 and 9 when we introduced LangChain’s capabilities and designed pipelines that retrieve text from an external file.
Let’s get some more perspective on RAGs by reviewing their strengths and weaknesses.
Let’s go through the advantages of RAGs in the following list:
由于 RAG 是一项建立在LLMs(LLM)基础上的新技术,而LLMs是另一项新技术,因此这带来了各种挑战。
With RAGs being a new technology that is built on LLMs, which is another new technology, this presents various challenges.
其中一项挑战是检索数据的结构设计的选择,这对于检索数据的功能非常重要。抹布。通常的做法是提前批量处理原始数据,以便在使用LLMs时,数据已经采用适合检索过程的格式。因此,当根据检索或提示的数量进行测量时,该离线过程的复杂度为 O(1)。矢量数据库正在成为此目的的首选设计。它们是数值数据库,旨在以与LLMs在处理提示时使用的格式相似甚至有时相同的格式捕获数据的最小表示形式。这种格式是我们在整本书中介绍的嵌入。应该补充的是,嵌入是有损压缩机制的一种形式。虽然嵌入空间针对预定义目的进行了优化,但它在两个意义上并不完美。首先,它优化了一个特定的损失函数,该函数可能比另一个目的更适合一个目的;其次,它在权衡其他方面的同时,例如存储和运行时间。我们看到嵌入空间中的一种趋势,即嵌入向量的维度越来越高。更高的维度可以容纳每个向量更广泛的上下文,从而为更好的检索机制打开了大门,反过来,这些机制又可以适应需要深入而复杂的见解的领域,例如法律空间或新闻业。
One such challenge is the choice of the structure design of retrieved data, which is significant to the functionality of the RAG. It is common practice to process the raw data ahead of time, in bulk, so that when the LLM is used, the data are already in a format that lends itself to the retrieval process. As such, this offline process has a complexity of O(1) when measured as a function of the number of retrievals or prompts. Vector databases are emerging as the go-to design for this purpose. They are numerical databases that aim to capture a minimal representation of the data in a format that is similar and sometimes identical to the format that the LLM employs when it processes a prompt. This format is the embeddings that we have covered throughout the book. It should be added that embeddings are a form of a lossy compression mechanism. While the embedding space is optimized for a predefined purpose, it isn’t perfect in two senses. First, it optimizes a particular loss function that may suit one purpose more than another, and second, it does so while trading off other aspects, such as storage and run time. We are seeing a trend within the embedding space where the dimensionality—the size of the embedding vector—is growing higher. A higher dimensionality accommodates a broader context per vector, thus opening the door for better retrieval mechanisms that, in turn, accommodate domains that require deep and complex insights, such as the legal space or journalism.
另一个缺点是,为了适应外部数据源提供的附加信息,发送到 LLM 的提示的大小需要增加。现在,提示不应包含整个数据库的文本。首先应用初步机制来缩小可能相关的文本范围,正如我们在医疗保健领域的代码示例中所看到的那样。尽管如此,必须对提示中发送的数据量应用截止,从而权衡LLMs必须引用的上下文量。
Another downside is the fact that in order to accommodate for the added information that the external data source is providing, the prompt that is sent to the LLM needs to grow in size. Now, the prompt isn’t expected to host the entire database’s text. A preliminary mechanism is first applied to narrow down the text that might be relevant, as we have seen in our code example in the healthcare space. Still, a cutoff must be applied to the amount of data that is sent within the prompt, thus trading off the amount of context the LLM has to refer to.
直接用例RAG 的数量与拥有专用于特定需求的引擎有关。一些示例如下:
The immediate use cases of RAGs are related to having an engine that is dedicated to a narrow need. Some examples can be seen in the following:
当我们将 RAG 视为可能主导内部服装开发的关键技术时,让我们讨论一下定制 LLM本身的更重、更全面的方法。
As we identify RAGs as a key technology to perhaps dominate in-house costumed development, let’s discuss the heavier and more comprehensive approach of customizing the LLM itself.
定制化趋势将持续加剧作为定制的LLMs,提供了其制造商专有的完整整体产品。我们很可能会看到行业或特定任务的LLMs成为常态。从专为法律术语量身定制的LLMs到擅长医学诊断的LLMs,未来是专业化的。这将涉及模型预训练、模型微调和基于检索的设计的各种设计选择,这些设计利用专用数据集。
The customization trend will continue intensifying as a customized LLM presents a complete holistic product that is proprietary to its maker. We’re likely to see industry or task-specific LLMs becoming the norm. From LLMs tailored to legal jargon to those adept at medical diagnoses, the future is specialized. This will involve the various design choices of model pre-training, model fine-tuning, and retrieval-based designs, which leverage dedicated datasets.
虽然典型的 RAG 适合利用内部和非公开数据,但定制的 LLM 适合需要学习和掌握整个领域的情况。例如,如果我们想选择这两种方法中的一种作为构思和综合 NLP 和 AI 解决方案的工具,我们会选择接受过相关数据(例如出版物、学习材料和专利)培训的LLMs,并且不是一个简单地将这些数据提供给通用LLMs的 RAG。定制的LLMs将提供从其训练数据中继承的一系列思想。RAG 将利用通用的 LLM以其通用的思想链,其中会有额外的数据可供参考。
While the typical RAG caters to leveraging in-house and non-public data, a customized LLM suits cases where an entire domain is to be learned and mastered. For instance, if we wanted to choose one of these two approaches as a tool that would ideate and synthesize NLP and AI solutions, we would choose an LLM that was trained on the relevant data, e.g., publications, learning material, and patents, and not a RAG that simply makes this data available to a generic LLM. The customized LLM would offer a chain of thought that is inherited from the data it was trained on. The RAG would leverage a generic LLM with its generic chain of thought, where it would have additional data to refer to.
我们现在已经谈到了提高LLMs绩效的四大支柱。从优化提示到组建专门的LLMs,必须在性能的潜在改进与流程的成本和复杂性之间进行权衡。图 10 .3描绘了这个概念:
We have now touched on the four pillars for enhancing LLMs’ performance. Going from optimizing the prompt to putting together a dedicated LLM, one must trade off the potential improvement in performance with the cost and complexity of the process. Figure 10.3 portrays this concept:
图 10.3 – 复杂度谱
Figure 10.3 – Spectrum of complexity
英语是新的编程语言。编码领域LLMs的前景尤其令人着迷。传统上,编码被视为一种专业技能,需要一丝不苟注重细节和广泛的培训。但随着LLMs的发展,软件开发世界民主化的潜力越来越大。我们正在见证一个长期愿景的实现,即开发人员可以向LLMs提供高级指令,而LLMs反过来会生成所需的代码,而不是钻研代码行。这就像拥有一个流利的翻译器,可以毫不费力地将人类意图转化为机器可读的指令。我们在第 9 章中看到了一个例子,LLMs承担了多个专业角色,并为用户组合了一个编程项目。
English is the new programming language. The outlook for LLMs in the realm of coding is particularly intriguing. Traditionally, coding has been seen as a specialized skill, demanding meticulous attention to detail and extensive training. But with the evolution of LLMs, there’s a growing potential to democratize the world of software development. We are witnessing the realization of a long-term vision where, instead of poring over lines of code, developers can provide high-level instructions to an LLM, which, in turn, generates the required code. It’s like having a fluent translator who can effortlessly turn human intent into machine-readable directives. We have seen an example in Chapter 9 where an LLM took on several professional roles and put a programming project together for the user.
这种转变不仅会简化编码过程,而且还会简化编码过程。它可以从根本上改变谁可以创建软件。非技术人员可以更直接地参与软件开发,缩小创意生成和执行之间的差距。例如,初创企业可以迅速将他们的愿景转化为原型,加快创新周期并培育更具包容性的技术生态系统。我们预计这将彻底改变多个业务学科,例如技术产品管理。当然,这并不意味着传统的编码技能将会过时。相反,理解编程语言的复杂性总是有其价值,特别是对于需要精确性和细微差别的任务。然而,LLMs可以充当宝贵的助手,抓住错误、建议优化,甚至帮助完成平凡和重复的任务。人类开发人员和LLMs之间的这种协同作用可能会带来软件开发的黄金时代,在这个时代,创造力成为中心舞台,技术障碍也会降低。此外,随着LLMs越来越擅长理解和生成代码,我们也可能会看到新算法、框架和工具的开发有所增加。这些进步可能会受到机器解决问题的独特视角的推动,并辅之以经过训练的大量数据和模式。
Such a shift won’t just streamline the coding process; it could fundamentally transform who gets to create software. Non-technical individuals could engage more directly in software development, bridging the gap between idea generation and execution. Start-ups, for instance, could swiftly turn their visions into prototypes, speeding up innovation cycles and fostering a more inclusive tech ecosystem. We anticipate this will revolutionize several business disciplines, for example, technical product management. Of course, this doesn’t imply that traditional coding skills will become obsolete. On the contrary, understanding the intricacies of programming languages will always have its value, especially for tasks that demand precision and nuance. However, LLMs can act as invaluable assistants, catching bugs, suggesting optimizations, or even helping with mundane and repetitive tasks. This synergy between human developers and LLMs might lead to a golden age of software development where creativity takes center stage and the technical barriers are lowered. Furthermore, as LLMs get more adept at understanding and generating code, we might also see an increase in the development of novel algorithms, frameworks, and tools. These advancements could be spurred on by the unique perspective that a machine brings to problem-solving, supplemented by the vast amounts of data and patterns it has been trained on.
总之,LLMs在编码领域的未来充满了协作、包容性和创新的前景。虽然挑战无疑会出现,但对于经验丰富的开发人员和该领域的新手来说,潜在的好处都是巨大的。
In summary, the future of LLMs in the coding world holds the promises of collaboration, inclusivity, and innovation. While challenges will undoubtedly arise, the potential benefits for both seasoned developers and newcomers to the field are enormous.
正如 DevOps 彻底改变了软件开发一样,LLM 操作( LLMOps ) 对于 LLM 的可扩展部署、监控和维护也变得至关重要。随着企业越来越依赖LLM,确保其平稳运行、持续学习和及时更新将成为最重要的。LLMOps 可能会引入简化这些流程的实践,确保 LLM 保持高效和相关性。我们看到人们以付费工具和服务的形式为这一事业做出了巨大的努力。公司正在设计涵盖运营和监控范围的解决方案。一方面是提供 LLM 功能基本监控的工具,另一方面是提供对传入数据、传出数据和模型特征的可视化和统计洞察的工具。
Just as DevOps revolutionized software development, LLM operations (LLMOps) are becoming crucial for the scalable deployment, monitoring, and maintenance of LLMs. As businesses increasingly rely on LLMs, ensuring their smooth operation, continuous learning, and timely updates will become paramount. LLMOps might introduce practices to streamline these processes, ensuring LLMs remain efficient and relevant. We are seeing great efforts made regarding this cause in the form of paid tools and services. Companies are designing solutions that stretch through the spectrum of operations and monitoring. On one end of the spectrum are tools that provide basic monitoring of the LLM’s functioning, and on the other end are tools that provide visuals and statistical insights into the incoming data, outgoing data, and model characteristics.
LLMOps 领域的一个新趋势是创建从监控源到模型调整机制的反馈循环。这模仿了实时自适应模型的概念,例如卡尔曼滤波器,它负责将阿波罗 11 号送上月球。监控流识别不断增长的偏差,然后将其反馈到调整的训练机制中模型的参数。通过这样做,用户不仅会收到有关模型何时变得次优的警报,而且还会对模型进行适当的调整。
A new trend in the LLMOps space is creating a feedback loop from the monitoring feed to the model tuning mechanism. This mimics the concept of real-time adaptive models, such as the Kalman filter, which is responsible for having brought Apollo 11 to the moon. The monitoring stream recognizes growing deviations, which are then fed back into a training mechanism that tunes the model’s parameters. By doing so, not only is the user given an alert about when the model becomes sub-optimal but the proper adjustment is also applied to the model.
总而言之,LLMs的旅程以深度学习、创新学习技术和定制能力的飞跃为标志,挖掘了人类更广泛的雄心:创造能够理解和增强我们世界的机器。LLMs的发展概括了这一追求,随着它们的不断成熟,它们的目的、价值和影响无疑将塑造我们数字化未来的轮廓。
To sum up this review, the journey of LLMs, marked by leaps in DL, innovative learning techniques, and customization capabilities, taps into a broader ambition of humanity: to create machines that understand and enhance our world. The evolution of LLMs encapsulates this quest, and as they continue to mature, their purpose, value, and impact will undoubtedly shape the contours of our digital future.
LLMs设计的未来取决于技术创新、以用户为中心的设计和道德考虑的交叉点。随着研究的进展和用户需求的发展,未来的LLMs可能会比我们今天想象的完全不同、能力更强、集成度更高。
The future of LLM design is poised at the intersection of technological innovation, user-centric design, and ethical considerations. As research progresses and user needs evolve, the LLMs of tomorrow might be radically different, more capable, and more integrated than what we imagine today.
我们讨论了LLMs的各种技术趋势,这是LLMs出现和发展的核心。现在,我们讨论远离核心的趋势,这些趋势反映了这些模型已经产生和预计将产生的影响。
We have discussed the various technical trends around LLMs, which are at the core of their emergence and growth. Now, we touch on the trends that are further away from the core and are reflective of the impact that these models have had and will are expected to make.
.NLP和 LLM的文化趋势
.Cultural trends in NLP and LLMs
在本节中,我们将讨论LLMs和人工智能对商业和社会的一些趋势和影响点。我们将谈谈我们认为由于LLMs和人工智能带来的价值而最有可能蓬勃发展的一些行业。我们将讨论企业在寻求获得优势并保持领先地位时发生的内部变化。最后,我们将讨论一些围绕LLMs和人工智能的文化方面。
In this section, we will discuss some of the trends and impact points that LLMs and AI have had on business and society. We will touch on some of the industries that we identify as likely to thrive the most, thanks to the value that LLMs and AI bring to the table. We will talk about the internal changes that are taking place in corporations as they seek to gain an advantage and stay ahead of the curve. Last, we will touch on some of the cultural aspects that revolve around LLMs and AI.
NLP 和 LLM 正在证明自己在商业领域具有变革性。从提高效率到实现新的业务模式,NLP 的功能已被用来实现自动化平凡的任务,从数据中获取见解,并提供高级客户支持。
NLP and LLMs are proving themselves to be transformative in the business domain. From improving efficiencies to enabling new business models, NLP’s capabilities have been harnessed to automate mundane tasks, derive insights from data, and provide advanced customer support.
最初,NLP 主要局限于学术界和专业部门。然而,随着数字化的兴起、数据的爆炸以及开源机器学习的进步,企业开始认识到它的潜力。计算能力的可负担性和对大量数据集的可访问性使得LLMs的实施对于企业来说是可行的,从而允许更复杂的 NLP 应用程序。我们观察到 NLP 向商业领域的转变发生在 2018 年至 2019 年。首先,为了完成文本分类等有限任务,NLP 和传统机器学习模型的结合开始渗透到业务运营和分析中。2019 年,Hugging Face 发布了 Google BERT 的免费版本,这是我们在前面的章节中讨论过的开创性的 LM(更多详细信息请参阅模型页面:https://huggingface.co/bert-base-uncased)。BERT 采用迁移学习的方式可以通过相对少量的标记数据实现强大的分类能力,并且它很快成为许多文本驱动的商业模型的首选模型。
Initially, NLP was mostly restricted to academia and specialized sectors. However, with the rise of digitalization, the explosion of data, and advancements in open source ML, businesses began to recognize its potential. The affordability of computing power and accessibility to vast datasets made the implementation of LLMs feasible for enterprises, allowing for more sophisticated NLP applications. We observed that this transition of NLP into the business world took place from 2018–2019. First, the combination of NLP and traditional ML models for the purpose of limited tasks, such as text classification, began to infiltrate business operations and analytics. In 2019, Hugging Face released a free version of Google’s BERT, its groundbreaking LM, which we discussed in previous chapters (see more detail on the model page: https://huggingface.co/bert-base-uncased). BERT employed transfer learning in a way that allowed for great classification power with a relatively minimal amount of labeled data, and it quickly became the go-to model for many text-driven business models.
一些行业继承了一些特征,使它们更有可能采用 NLP 驱动的自动化并在此基础上蓬勃发展。在评估 NLP 对行业甚至特定业务的潜在影响时,请考虑以下特征:
Some industries have inherited characteristics that make them more likely to adopt NLP-driven automation and thrive on it. When looking to evaluate the potential impact that NLP would have on an industry or even on a particular business, consider these traits:
让我们探索特定的业务领域,看看人工智能和LLMs如何在每个领域发挥作用。
Let’s explore specific business sectors to see how AI and LLMs are making a difference in each of them.
医疗保健是一个严重依赖免费文本的行业。医疗保健领域中与患者治疗互动的每个企业,无论是诊所、医院,甚至是保险公司,都拥有数据涉及自由文本的流。它可以是医疗记录、患者查询响应、药物相互作用和其他信息来源的转录。其中绝大多数都是数字化的,因此是机器可读的,使其成为下游处理的设置。这些流程可能包括从放射学报告中识别诊断、对治疗的患者详细信息进行分类、根据医生笔记进行临床试验、根据患者报告提醒潜在风险以及许多其他用例。
Healthcare is an industry that relies heavily on free text. Every business in the healthcare space that interacts with patient treatment, whether it is a clinic, a hospital, or even in an insurer, has a data stream that involves free text. It could be a transcription of medical notes, patient query responses, drug interactions, and other sources of information. The vast majority of those are digitized and are, thus, machine-readable, making this a setup for downstream processing. Those processes could be around identifying diagnoses from radiology reports, classifying patient details for treatment, clinical trials based on physician notes, alerting potential risk based on patient reporting, and many other use cases.
医疗保健领域出现的另一个主要用例是患者通过 ChatGPT 等生成式 AI 工具寻求医疗建议。由于LLMs可以访问大量数据,患者发现LLMs可以为医学问题提供答案。虽然潜力巨大,但风险也很大。
Another major use case that is emerging in healthcare is around patients seeking medical advice from generative AI tools such as ChatGPT. As LLMs have access to a sea of data, patients found that an LLM may suggest an answer to a medical question. While the potential is huge, the risk is great as well.
在接下来的几年中,我们预计LLMs支持医疗保健需求的能力将得到重大改进。特别是在患者护理方面,我们将看到核心医疗能力的提高。不同层次的医疗建议、诊断和预后将在专业建议和人工智能建议之间分配不同的平衡。例如,纵观历史,我们看到患者自我诊断轻微的病症,例如皮疹或疼痛,或听取其他非专业人士的建议。此外,如今,我们看到患者在网上文章和帖子中寻求建议。我们预计,对于这些被认为是低风险的相同病症,患者将采用LLMs寻求建议。至于官方政策,我们将看到临床系统规定人工智能将处理哪些病例以及处理何种程度的指南。
In the next few years, we anticipate major improvements regarding LLMs’ ability to support healthcare needs. With patient care in particular, we will see an improvement in augmenting core medical competencies. Different tiers of medical advice, diagnoses, and prognoses will be assigned different balances between professional advice and AI advice. For instance, throughout history, we have seen patients self-diagnose mild conditions, such as a rash or a pain, or take advice from other non-professionals. Moreover, nowadays, we see patients seeking advice in online articles and posts. We expect that for these same conditions, which are perceived as low-risk, patients will adopt LLMs for advice. As for official policies, we will see clinical systems dictate guidelines as to which cases would be handled by AI and to what extent.
金融是一个严重依赖文本信息的广泛行业。从财务报告到财报电话会议,从新闻推送到监管更新,从交易细节到信用报告,等等。金融业被视为其他行业如何随着人工智能的兴起而发展的先驱。它对数据处理的严重依赖使其自然适合人工智能,并可以作为其他地方可能发生的情况的案例研究。
Finance is a broad industry that is heavily dependent on text information. From financial filings to earning calls, news feeds to regulatory updates, transaction details to credit reports, and so on. The financial sector is seen as a precursor to how other industries might evolve with the rise of AI. Its heavy reliance on data processing makes it a natural fit for AI and serves as a case study for what might happen elsewhere.
我们看到 NLP 和 LLM 被应用于金融领域的各个角落。我们注意到的一个新趋势是为特定主题甚至个别公司构建专用聊天机器人,因为他们寻求以交互式聊天机器人的形式向客户展示其专有服务。
We see NLP and LLMs used in all corners of the financial spectrum. A new trend we are noticing is building dedicated chatbots for particular topics and even individual companies as they seek to present their proprietary service to their customers in the form of an interactive chatbot.
我们对金融未来的总体期望是一个协作环境,其中人工智能驱动的模型与行业专家无缝协作。我们对这一愿景的最佳历史类比是 Microsoft 在 Excel 和财务分析师之间创建的协同作用。设想一个传统人工智能模型绘制财务预测及其生成的环境同行深入研究数据,不仅强调差异,还根据不同的预测模型提出战略选择。
Our overall expectation for the future of finance is a collaborative environment where AI-driven models seamlessly work in tandem with industry specialists. The best historical analogy we have for this vision is the synergy that Microsoft created between Excel and financial analysts. Envision a setting where a traditional AI model maps out financial projections and its generative counterpart dives deep into the data, not just highlighting variances but also suggesting strategic choices based on diverse forecast models.
电子商务是一个始终处于客户和技术交叉点的行业。电子商务领域的一个用例是个性化购物体验。随着自然语言处理技术变得越来越复杂,电子商务平台可以预测新兴趋势,根据用户情绪提供实时个性化折扣,并增强交叉销售和追加销售策略。从产品搜索方面来看,LLMs理解自然语言查询,使用户能够更有效地找到产品。
E-commerce is an industry that constantly sits at the intersection of customers and technology. One use case in the e-commerce space is personalized shopping experience. As NLP techniques become more sophisticated, ecommerce platforms can predict emerging trends, offer real-time personalized discounts based on user sentiment, and enhance cross-selling and upselling strategies. From the aspect of product search, LLMs understand natural language queries, enabling users to find products more effectively.
电子商务的未来格局将发生转型。随着人工智能元宇宙购物的出现,虚拟领域正在不断扩大,结合了视觉人工智能、增强现实和虚拟现实技术。这将为消费者提供一个激动人心的机会来虚拟试用从服装到家具的产品,提供尽可能接近现实的购物体验。此外,供应链管理的复杂性将继续通过人工智能驱动的预测分析来解决,优化库存流程。人工智能有望成为塑造电子商务行业充满活力和高效的未来的基石。
The future landscape of e-commerce is set to undergo a transformational shift. The virtual realm is expanding with the advent of AI-enabled metaverse shopping, combining visual AI, augmented reality, and virtual reality technologies. This will present consumers with a thrilling opportunity to try products virtually, from clothing to furniture, providing a shopping experience that’s as close to reality as possible. Moreover, the complexities of supply chain management will continue to be addressed with AI-driven predictive analytics, optimizing inventory processes. AI promises to be a cornerstone in shaping a dynamic and efficient future for the eCommerce industry.
我们要提到的倒数第二个行业是教育。在这里,我们也看到了个性化的趋势。NLP 允许自适应学习平台满足个别学生的需求,根据他们的学习速度和风格提供资源和测验。NLP 驱动的平台可以分析学生的输入、论文和反馈,以提供定制的学习路径。另一个趋势是围绕语言学习。LLMs提供实时翻译、更正,甚至文化背景,使语言学习更加身临其境。
The second-to-last industry we want to mention is education. Here, too, we are seeing a trend around personalization. NLP allows for adaptive learning platforms that cater to individual student needs, providing resources and quizzes based on their learning pace and style. NLP-driven platforms can analyze student inputs, essays, and feedback to offer custom-tailored learning paths. Another trend is around language learning. LLMs offer real-time translations, corrections, and even cultural context, making language learning more immersive.
随着生成式人工智能工具的快速发展越来越渗透到教育领域,传统的教学模式即将发生重大变化。我们预计未来人工智能将无缝融入课堂,以前所未有的方式提高教学效率并个性化学习体验。同时,我们将看到个性化方面的进步,学生可以享受最好描述为计算机化私人导师的学习体验。它将调整所教授的材料和传达的方式,以适应学生的节奏和感知。对于当今时代出生的孩子,我们期望教育体验是创新的、无限的、一点也不乏味。
As the rapid development of generative AI tools increasingly permeates the education sector, the traditional paradigms of teaching and learning are poised for substantial change. We anticipate a future where AI seamlessly integrates into classrooms, amplifying the efficacy of instruction and personalizing learning experiences in unprecedented ways. Simultaneously, we will see advancements in personalization where students can enjoy a learning experience that would be best described as a computerized private tutor. It would adapt the material being taught and the manner in which it is communicated to suit the student’s pace and perception. For children born in current times, we expect the educational experience to be innovative, limitless, and not at all boring.
娱乐业内容消费排在最后但并非最不重要的位置。近年来,人工智能与媒体行业之间的相互关系日益明显。随着LLMs和人工智能的不断发展,媒体平台已经利用它们来优化内容创作、分发和消费。
The industry of entertainment and content consumption is given the last but not least spot. The reciprocal relationship between AI and the media industry has become evident in recent years. With LLMs and AI continually evolving, media platforms have harnessed them to optimize content creation, distribution, and consumption.
音乐格局正在被重塑。深度学习模型在学习现有的音乐模式后生成独特的作品。Spotify 等平台通过机器学习驱动的推荐、分析收听历史和偏好来个性化播放列表。音频母带处理过程传统上需要专业知识,现在融入了 LANDR 等人工智能解决方案,实现了音乐制作的民主化和加速。
The music landscape is being reshaped. DL models generate distinct compositions after learning from existing musical patterns. Platforms such as Spotify personalize playlists through ML-driven recommendations, analyzing listening history and preferences. The audio mastering process, traditionally demanding expertise, now incorporates AI solutions such LANDR, democratizing and accelerating music production.
电影制作人利用LLMs进行剧本创作,能够创作独特的叙事,同时评估剧本中潜在的不确定性。华纳兄弟、20 世纪福克斯和索尼影业都展示了人工智能的预测能力,它们都分别使用 Cinelytic、Merlin 和ScriptBook 等平台。
Filmmakers harness LLMs for scriptwriting, enabling the creation of unique narratives while also assessing potential uncertainties in screenplays. AI’s predictive prowess is showcased by Warner Bros., 20th Century Fox, and Sony Pictures, all of which utilize platforms such as Cinelytic, Merlin, and ScriptBook, respectively.
人工智能通过模拟真实的非玩家角色行为和动态生成内容来丰富游戏玩法。它提供个性化的游戏推荐,根据玩家的喜好定制体验。自适应难度系统分析实时玩家行为,调整挑战以确保平衡的游戏体验。
AI enriches gameplay by simulating realistic non-player character behaviors and dynamically generating content. It offers personalized game recommendations, tailoring the experience to player preferences. Adaptive difficulty systems analyze real-time player behavior, adjusting challenges to ensure a balanced gaming experience.
在图书出版领域,人工智能简化了稿件提交流程,自动筛选并预测市场潜力。人工智能驱动的工具通过确保清晰度、连贯性和遵守风格指南来支持编辑阶段。LLMs通过提供对人物和情节结构的见解,帮助作者创作引人入胜的叙述。平台中的个性化算法可根据用户的口味定制内容推荐,从而提高参与度。Google AdSense 等平台利用人工智能精确定位在线广告,优化营销活动的覆盖范围。人工智能还发挥着监管作用,根据用户人口统计数据过滤内容并确保遵守广播准则。最后,流媒体平台采用人工智能进行内容分类,为用户提供无缝的内容发现体验。
In the world of book publishing, the manuscript submission process is streamlined by AI, automating screenings and predicting market potential. AI-driven tools bolster the editing phase by ensuring clarity, coherence, and adherence to style guidelines. LLMs aid authors in crafting compelling narratives by providing insights into character and plot structures. Personalization algorithms in platforms tailor content recommendations to users’ tastes, enhancing engagement. Platforms such as Google AdSense utilize AI to target online advertisements precisely, optimizing campaign outreach. AI also plays a regulatory role, filtering content based on user demographics and ensuring compliance with broadcasting guidelines. Finally, streaming platforms employ AI for content categorization, offering users a seamless content discovery experience.
人工智能和LLMs在娱乐行业的这些超级创新应用将不断发展并塑造他们所接触的创作。创建过程将更短、更快。将变得越来越频繁的问题是是否有创造由计算机模型精心策划的艺术将失去其魅力。
These super innovative utilizations of AI and LLMs in the entertainment industry are going to grow and shape the creations they touch. The creation processes will be shorter and faster. The question that will become more and more frequent is whether having the creation of art orchestrated by a computer model will take away from its charm.
接下来,我们将从业务部门退后一步,讨论在任何面向客户的业务中普遍存在的特定用例。
Next, we’ll take a step back from business sectors and discuss a particular use case that is ubiquitous across any customer-facing business.
NLP 最明显的影响之一企业存在于客户互动中。LLMs支持响应式聊天机器人、协助情绪分析并提供实时解决方案,从而增强用户体验。早期的聊天机器人是基于规则的,可以处理有限的查询。借助LLMs,聊天机器人可以理解上下文、处理复杂的查询,甚至进行随意的对话。这一进展提高了客户满意度,减少了等待时间,并为企业节省了大量成本。
One of the most visible impacts of NLP in businesses is in customer interactions. LLMs enable responsive chatbots, assist in sentiment analysis, and provide real-time solutions, enhancing user experience. Early chatbots were rule-based and could handle limited queries. With LLMs, chatbots can understand context, handle complex queries, and even engage in casual conversations. This progression has led to increased customer satisfaction, reduced wait times, and substantial cost savings for businesses.
在接下来的几年中,我们预计人工智能和LLMs将继续应用于广泛的客户服务应用程序,包括聊天机器人、推荐系统、主动客户参与系统和客户服务分析系统。这些人工智能和LLMs支持的应用程序将能够为企业和客户带来多种好处。我们将看到聊天机器人变得更加全面,能够处理目前需要人工代理介入的情况。推荐系统将进一步个性化并捕获单个客户的兴趣,并将同化目前属于个人助理的特权。人口的一小部分。在宏观层面上,客户服务分析系统将用于分析客户数据并识别可用于改进客户服务运营的趋势和模式。
In the next few years, we can expect to continue to see AI and LLMs used in a wide range of customer service applications, including chatbots, recommendation systems, proactive customer engagement systems, and customer service analytics systems. These AI and LLM-powered applications will be able to deliver several benefits to both businesses and customers. We will see chatbots become more comprehensive to the extent of being able to handle those cases that currently require a human agent to step in. Recommendation systems will further personalize and capture the individual customer’s interests and will assimilate personal human assistants who are currently the privilege of a tiny portion of the population. On a macro level, a customer service analytics system would be used to analyze customer data and identify trends and patterns that can be used to improve customer service operations.
总体而言,人工智能和LLMs在客户服务方面的前景非常光明。这些技术有望改变企业与客户的互动方式,提供更加量身定制、更具预见性和身临其境的服务体验。
Overall, the prospects for AI and LLMs in customer service are exceptionally promising. These technologies stand poised to transform business-customer interactions, offering more tailored, anticipatory, and immersive service experiences.
探索了人工智能和LLMs在客户服务中的变革性作用后,现在让我们转向另一个关键维度:组织结构。随着公司为人工智能时代做好准备,必须了解他们如何重塑内部框架以整合这些技术进步。
Having explored the transformative role of AI and LLMs in customer service, let’s now pivot to another critical dimension: organizational structures. As companies gear up for the AI era, it’s imperative to understand how they’re reshaping their internal frameworks to integrate these technological advances.
随着人工智能,尤其是LLMs的能力继续迅速崛起,世界各地的企业都感受到了连锁反应。为了保持竞争力并充分利用这些技术奇迹的潜力,许多组织正在内部进行变革。结构和运营。这些变化包括重新构想工作流程动态以及引入首席人工智能官等关键角色。我们现在将探讨人工智能的深远影响如何重塑当代商业范式的结构。
As AI, particularly the capabilities of LLMs, continues its meteoric rise, businesses worldwide are feeling the ripple effects. To remain competitive and harness the full potential of these technological marvels, many organizations are undergoing transformative shifts in their internal structures and operations. These changes range from reimagining workflow dynamics to the introduction of pivotal roles such as the Chief AI Officer. We will now explore how AI’s profound influence is reshaping the very fabric of contemporary business paradigms.
除了外部客户互动之外,LLMs还深刻影响了企业内部的运营方式。从自动化电子邮件到处理 HR 查询,NLP 简化了操作。最初,企业使用简单的自动化工具来处理重复性任务。通过LLMs,自动化任务的范围已经扩大。无论是起草报告、分析员工反馈,或者预测市场趋势,NLP 起着举足轻重的作用。
Beyond external customer interactions, LLMs have deeply impacted how businesses operate internally. From automating emails to handling HR queries, NLP has streamlined operations. Initially, businesses used simple automation tools to handle repetitive tasks. With LLMs, the spectrum of automatable tasks has widened. Whether it’s drafting reports, analyzing employee feedback, or predicting market trends, NLP plays a pivotal role.
我们在组织环境中看到的一个特殊转变与技术堆栈结构有关。传统上,公司的技术堆栈可以被视为一层蛋糕,每一层都有不同的作用:
A particular shift we are seeing in the organizational landscape regards the tech stack structure. Traditionally, a company’s tech stack can be visualized as a layer cake, with each layer having a distinct role:
随着人工智能的发展,新的层和组件不断被引入,重塑了技术堆栈:
With the evolution of AI, new layers and components are being introduced, reshaping the tech stack:
让我们回顾一下这些新增内容。
Let’s go over these new additions.
这些变化是我们看到的人工智能驱动的快速创新的成果。例如,多模式功能正在兴起,使我们能够处理文本、图像、视频、音频和音乐以及代码形式的信号。此外,聊天机器人、推荐系统和预测分析工具等人工智能产品对企业来说变得至关重要。
These changes are the fruit of the rapid innovations we see driven by AI. For instance, multimodal capabilities are emerging and enabling us to process signals in the form of text, images, video, audio and music, and code. Moreover, AI products such as chatbots, recommendation systems, and predictive analytics tools are becoming essential for businesses.
修订后的决策层现在由人工智能应用程序驱动。与传统软件不同,人工智能应用程序具有“思考”和“学习”的能力。它们以曾经被认为不可能的方式处理多媒体内容,例如图像、视频和音乐。例如,通过图像识别,人们可以识别照片中的对象并对其进行分类,而视频分析可以分析实时镜头中的模式和异常情况。更令人着迷的是其中一些应用程序能够生成新的音乐作品或艺术品,从而弥合技术与艺术之间的差距。
The revised decision-making layer is now driven by AI applications. Unlike traditional software, AI applications are built with the capability to “think” and “learn.” They process multimedia content, such as images, videos, and music, in ways that were once thought impossible. For instance, through image recognition, one can identify and categorize objects in a photo, while video analytics can analyze patterns and anomalies in real-time footage. Even more fascinating is the ability of some of these apps to generate new music compositions or artworks, bridging the gap between technology and art.
下一个新层是人工智能层。其核心组成部分是人工智能产品。当我们谈论人工智能产品时,我们指的是建立在人工智能基础上的大量工具和平台。这些范围从提供实时客户支持的聊天机器人到在电子商务平台上提供个性化用户体验的推荐系统。预测分析是人工智能产品的另一个支柱,它使企业能够预测趋势并做出明智的决策。总的来说,这些产品代表了从被动到主动的业务战略的范式转变,确保企业始终领先一步。
The next new layer is the AI layer. Its key component is AI products. When we talk about AI products, we refer to a vast array of tools and platforms built on the foundation of AI. These range from chatbots that provide real-time customer support to recommendation systems that personalize user experiences on e-commerce platforms. Predictive analytics, another pillar of AI products, allows businesses to forecast trends and make informed decisions. Collectively, these products represent a paradigm shift from reactive to proactive business strategies, ensuring that businesses are always a step ahead.
可观察性和监控通过降低风险和应用质量控制来补充上述内容。人工智能虽然强大,但它也带来了道德和运营方面的挑战。人工智能护栏可以通过确保人工智能在明确的道德界限内运行、促进公平、透明度和隐私来解决这些问题。例如,人工智能护栏可能会阻止算法根据有偏见的数据做出决策,或者它可以为人工智能系统做出的决策提供解释。在对技术的信任至关重要的时代,这些护栏对于确保人工智能不仅聪明而且负责任至关重要。在加强护栏的同时,应用传统的数据和模型输出生产监控来确保一致性和质量。
Observability and monitoring supplement the above additions by mitigating risk and applying quality control. As powerful as AI is, it also brings forth ethical and operational challenges. AI guardrails can address these concerns by ensuring that AI operates within defined ethical boundaries, promoting fairness, transparency, and privacy. For instance, an AI guardrail might prevent an algorithm from making decisions based on biased data, or it could offer explanations for the decisions an AI system makes. In an age where trust in technology is paramount, these guardrails are crucial for ensuring that AI is not just smart but is also responsible. At the same time as enforcing guardrails, the traditional production monitoring of data and model outputs is applied to assure consistency and quality.
总结我们对技术堆栈转变的讨论,我们预计人工智能不仅仅是技术带来的趋势,而是技术新趋势的推动者。因此,我们期望数据和技术范式发生改变,并将人工智能置于中心位置。我们相信,调整和发展其堆栈以利用这些新功能的公司将能够更好地在这个新的数字时代取得成功。
To conclude our discussion of the shift in tech stacks, we anticipate AI to be more than a trend that technology enables but rather an enabler for new trends of technology. For that reason, we expect the data and tech paradigm to change and put AI in the center. We believe companies that adapt and evolve their stacks to harness these new capabilities will be better positioned to succeed in this new digital age.
当我们回顾现代组织不断演变的重塑时,让我们回顾一下企业界的一个特殊补充:首席人工智能官。这一立场强调了人工智能在现代企业舞台上的至关重要性。
As we review the evolving reshaping of modern organizations, let’s review a particular addition to the corporate world: the chief AI officer. This is a position that underscores the paramount importance AI holds in the modern corporate arena.
由于人工智能将影响商业,预计它也将重塑商业。在上一节中,我们详细介绍了对通用组织技术堆栈的预期,该技术堆栈将发生转变并为纯粹面向人工智能的组件提供空间。遵循类似的路径,领导结构预计也会发生变化,并为新角色腾出空间:首席人工智能官(CAIO)。本节将深入探讨 CAIO 的角色、职责以及他们为组织带来的独特价值。
As AI is set to impact business, it is expected that it will also reshape businesses. In the previous section, we detailed our anticipation of the common organizational tech stack that will transform and give room to components that are purely AI-oriented. Following a similar path, the leadership structure is also expected to change and make room for a new role: the chief AI officer (CAIO). This section will delve deep into the CAIO’s role, responsibilities, and the unique value they bring to an organization.
人工智能不再是遥远的技术奇迹;它现在已经融入我们的日常生活中。随着 OpenAI 的 ChatGPT 和 Google 的 Bard 等生成工具的创建,人工智能的能力现在已经变得越来越强大。各种性质的企业都可以使用。人工智能的变革潜力涵盖从创造创新服务、提高运营效率到彻底改变整个行业。
AI is no longer a distant technological marvel; it’s now intertwined in our everyday lives. With the creation of generative tools such as OpenAI’s ChatGPT and Google’s Bard, AI’s capabilities are now accessible to businesses of all natures. AI’s transformative potential ranges from creating innovative services and improving operational efficiency to revolutionizing entire industries.
鉴于人工智能的影响力,将其纳入核心业务战略势在必行。对 CAIO 的需求源于将人工智能嵌入战略决策的重要性,以确保公司利用它带来的机会。
Given the impactful nature of AI, incorporating it into the core business strategy is imperative. The need for a CAIO arises from the importance of embedding AI in strategic decisions, ensuring that companies capitalize on the opportunities it presents.
CAIO 的核心职责是指导组织的人工智能战略与其总体业务目标保持一致。这包括以下内容:
Central to the CAIO’s responsibilities is guiding the organization’s AI strategy to align with its overarching business objectives. This encompasses the following:
由于技术敏锐性和软技能的平衡至关重要,CAIO 应熟练掌握人工智能工具和基础设施,并在沟通、团队合作、解决问题和时间管理方面表现出色。
With a balance of technical acumen and soft skills being pivotal, the CAIO should be adept with AI tools and infrastructure and also excel in communication, teamwork, problem-solving, and time management.
他们必须精通人工智能的商业影响,了解其当前的格局,并预测未来的发展。他们必须了解特定人工智能技术可能对其行业产生的影响。
They must be well - versed in the business implications of AI, understanding its present landscape, and anticipating future developments. It’s essential for them to be attuned to the ramifications that specific AI technologies might have on their industry.
在人工智能的道德考虑至关重要的时代,CAIO 必须成为道德支柱,应对与偏见、隐私和社会影响相关的挑战。预计公司合规团队和法律团队之间将形成直接、流畅的沟通渠道,以帮助识别和预测敏感问题CAIO 可能涉足的领域。
In an age where AI’s ethical considerations are paramount, the CAIO must be an ethical pillar, navigating challenges related to bias, privacy, and societal impact. There is an expectation that a direct and fluid channel of communication being will be formed between the company’s compliance team and legal team so as to help identify and anticipate sensitive territories that the CAIO may step into.
总之,随着企业越来越多地将人工智能整合到其运营结构中,CAIO 的角色变得不可或缺。他们充当火炬手,为组织以道德和有效的方式充分利用人工智能的潜力指明了道路。随着人工智能在商业领域的重要性不断增强,CAIO 有望成为现代高管层的基石。
In conclusion, as businesses increasingly integrate AI into their operational fabric, the CAIO’s role emerges as indispensable; they serve as the torchbearers, illuminating the path for organizations to harness AI’s full potential ethically and effectively. As AI’s significance in the business realm augments, the CAIO stands poised to be a cornerstone of the modern C-suite.
虽然人工智能和LLMs无疑正在彻底改变商业格局,但它们的影响范围超出了企业领域。当我们进入下一部分时,我们将探讨这些技术带来的深刻的社会和行为影响,影响我们社会的结构。
While AI and LLMs are undoubtedly revolutionizing the business landscape, their reach extends beyond the corporate realm. As we transition into our next section, we’ll explore the profound social and behavioral implications these technologies bring to the fore, impacting the very fabric of our society.
人工智能的扩散,特别是LLMs等先进模型,对社会行为产生了深远的影响。这种影响范围从日常任务到更广泛的沟通趋势。随着人工智能融入日常生活的结构,它塑造了行为,引入了新的规范,并偶尔会引起担忧。在这里,我们深入研究这些行为转变。
The proliferation of AI, particularly advanced models such as LLMs, has had a profound impact on social behavior. This influence ranges from everyday tasks to broader communication trends. As AI integrates into the fabric of daily life, it shapes behaviors, introduces new norms, and occasionally raises concerns. Here, we dive into these behavioral shifts.
随着 Siri、Alexa 和 Google Assistant 等人工智能驱动的虚拟助手的增加,人们越来越依赖这些工具来完成日常任务。无论是安排约会、查看天气还是控制智能家居设备,人工智能助手正在成为许多人的首选,它改变了我们与技术交互的方式,有时甚至引导我们将这些工具拟人化。
With the increase in AI-driven virtual assistants such as Siri, Alexa, and Google Assistant, people are increasingly relying on these tools for daily tasks. Whether it’s setting up appointments, checking the weather, or controlling smart home devices, AI assistants are becoming the go-to for many, changing the way we interact with technology and sometimes even leading us to anthropomorphize these tools.
未来,我们将看到人工智能个人助理成为我们生活中完全沉浸式且不可分割的一部分。我们将其类比为数字日历在我们生活中所扮演的狭窄而有限的角色。通过让我们能够有效地计划和安排活动,保留日历可以确保我们履行承诺并在个人和职业参与之间保持平衡。此外,自动提醒和跨设备同步减轻了记住每个约会的压力,让我们安心地专注于更紧迫的事情。个人助理,无论是人工智能驱动的还是人类的,都能将事情提升到一个新的水平。它与其他人同步、确定优先级、提供建议、收集信息并执行其他常见的日常任务。直到最近,才人类助手可以高度自信地完成这一功能。我们很快就会看到自动化模型可以以很少的成本和监督来完成这一任务。如果您戴着处方眼镜,您就会确切地知道我们与个人人工智能助理的关系会是什么样子,而且,如果您无法访问它会是什么样子。
In the future, we will see AI personal assistants become a completely immersive and non-separable part of our lives. We analogize it to the narrow and limited role that the digital calendar takes in our lives. By allowing us to plan and schedule events efficiently, keeping a calendar ensures we meet commitments and maintain a balance between personal and professional engagements. Furthermore, automated reminders and synchronization across devices alleviate the pressure to remember every appointment, letting us focus on more pressing matters with peace of mind. A personal assistant, whether AI-driven or human, takes things to the next level. It syncs with other individuals, prioritizes, advises, gathers information, and performs other common day-to-day tasks. Until recently, only human assistants could fulfill this function with high confidence. We will soon see this done by automated models with little cost and oversight. If you are wearing prescription glasses, you know exactly what our relationship with our personal AI assistant will be like and, moreover, what it would be like if you lost access to it.
LLMs改进了我们的沟通方式,尤其是在书面内容方面。人们使用它们进行语法检查、内容建议,甚至生成整个文本。这可以带来更完美的沟通,但也会带来关于真实性的问题。
LLMs have refined the way we communicate, especially when it comes to written content. People use them for grammar checks, content suggestions, or even generating entire texts. This can lead to more polished communication but also brings up questions about authenticity.
由人工智能驱动的实时翻译工具正在彻底改变我们跨文化交流的方式。谷歌翻译等平台使个人能够无缝互动,从而促进全球联系。然而,对这些工具的日益依赖可能会削弱一些人学习新语言的动力。
Real-time translation tools powered by AI are revolutionizing the way we communicate across cultures. Platforms such as Google Translate are making it feasible for individuals to interact seamlessly, fostering global connections. However, the increased reliance on these tools might diminish the incentive for some to learn new languages.
在不久的将来,在先进LLMs和人工智能创新融合的推动下,交流的边界将进一步扩大。我们很快就会看到这样的愿景的实现:两个人通话,每个人都讲不同的母语,并且可以进行无缝对话,而人工智能会隐形地立即翻译他们所说的话。这意味着,当一个人用普通话说话时,对方可能会实时听到西班牙语单词,并且延迟极小。这些进步可以有效消除语言障碍,实现真正的全球人际联系。
In the near future, the boundaries of communication are poised to expand even further, driven by the convergence of advanced LLMs and AI innovations. We will soon see the realization of the vision where two individuals have a call, each speaking a different native language, and can engage in a seamless conversation, with AI invisibly and instantly translating their spoken words. This would mean that, as one person speaks in Mandarin, their counterpart might hear the words in Spanish in real time, with a minimally noticeable delay. Such advancements could effectively eradicate language barriers, allowing for truly global interpersonal connectivity.
此外,沟通的领域不仅仅限于口头语言。前沿研究正在深入研究将神经信号直接转换为语音的可能性。神经传感器将检测并解释大脑活动,使个人无需移动嘴唇即可“说话”。这可能是一个突破性的进步,特别是对于那些有语言障碍或沟通障碍的人来说,为他们提供了以前从未经历过的表达方式。
Furthermore, the realm of communication is not just limited to the spoken word. Cutting-edge research is delving into the possibility of converting neural signals directly into speech. Neural sensors will detect and interpret brain activity, allowing individuals to “speak” without ever moving their lips. This could be a groundbreaking advancement, especially for those with speech impediments or communication disorders, offering them a voice in a way they’ve never experienced before.
除了这些功能之外,沟通的触觉维度也可能会出现创新。我们期望可穿戴设备能够让人们“感受”信息,将文字或情感转化为特定的触觉。这将开辟新的理解渠道,特别是对于视力或听力受损的人。
Beyond these capabilities, the tactile dimension of communication might also see innovation. We anticipate wearable devices that allow people to “feel” messages, translating words or emotions into specific tactile sensations. This would open up new channels of understanding, especially for the visually or hearing impaired.
AR 与 AI 将重新定义我们的存在概念。虽然 Meta 的 Metaverse 正在努力巩固,但通过虚拟存在进行交互的概念将会出现并产生需求。您将能够投影您的化身到遥远的地方,与他人交流,就好像您身临其境一样。面部表情、肢体语言和手势的细微差别将被捕捉和转发,从而增加远程对话的深度。
AR with AI will redefine our notion of presence. While Meta’s Metaverse is struggling to solidify, the notion of interacting via virtual presence will emerge and have demand. You will be able to project your avatar to a distant location, communicating with others as if you were physically there. The nuances of facial expressions, body language, and gestures will be captured and relayed, adding depth to remote conversations.
随着人们逐渐习惯人工智能的推荐,从购物到阅读,都存在过度授权决策的风险。这可能会导致批判性思维减少,使个人更容易受到算法偏差或操纵的影响。
As people grow accustomed to AI recommendations, from shopping to reading, there’s a risk of over-delegating decisions. This can lead to reduced critical thinking, making individuals more susceptible to algorithmic biases or manipulations.
随着我们进一步进入人工智能驱动的时代,个人对自动化系统过度信任的可能性越来越大,这可能会导致个人责任和代理权的削弱。人们越来越担心,随着越来越多的决策自动化,社会可能会发现个人在没有算法输入的情况下做出明智判断的能力下降。此外,随着行业越来越依赖人工智能进行关键决策,这些算法的透明度和理解对于防止无意的系统偏差将变得至关重要。人工智能有可能通过数据或设计延续甚至放大现有的社会偏见,从而产生深远的伦理影响。作为回应,我们预计对人工智能道德课程、透明算法框架和监管监督的需求将激增,以确保人工智能系统符合人类价值观和社会规范。
As we advance further into an AI-driven era, there’s an increasing likelihood that individuals will place undue trust in automated systems, potentially leading to an erosion of personal responsibility and agency. There’s a growing concern that, as more decisions are automated, society might witness a decline in individuals’ ability to make informed judgments without algorithmic input. Moreover, as industries increasingly rely on AI for critical decisions, the transparency and understanding of these algorithms will become paramount to prevent unintentional systemic biases. The potential for AI to perpetuate or even amplify existing societal biases—either through data or design—raises profound ethical implications. As a response, we anticipate a surge in demand for AI ethics courses, transparent algorithmic frameworks, and regulatory oversight to ensure AI systems align with human values and societal norms.
总结我们对这些不同社会趋势的回顾,人工智能和LLMs正在以多方面的方式重塑社会格局。虽然它们带来了便利和新颖的体验,但它们也提出了社会必须应对的挑战。随着人工智能在日常生活中的作用不断发展,平衡好处与潜在陷阱至关重要。
To sum up our review of these various social trends, AI and LLMs are reshaping the social landscape in multifaceted ways. While they introduce conveniences and novel experiences, they also present challenges that society must navigate. Balancing the benefits with the potential pitfalls will be crucial as AI’s role in daily life continues to evolve.
我们现在将重点转移到人工智能的两个特定方面,道德和风险,这两个方面可能正成为每个寻求使用人工智能的个人和实体感兴趣的方面。
We now shift the focus to two particular aspects of AI that are becoming of interest to perhaps every person and entity seeking to employ AI, ethics, and risks.
在整本书中,我们讨论了有关人工智能、特别是LLMs的各个方面。我们简单地讨论了不同的新兴问题,在本节中,我们将重点关注两个最大的讨论主题:道德和风险。
Throughout the book, we have discussed a variety of aspects with regard to AI in general and LLMs in particular. We touched lightly on the different emerging concerns, and in this section, we will focus on the two biggest discussion topics: ethics and risks.
人工智能(尤其是LLMs)融入我们的生活带来了无与伦比的便利和潜力。然而,与这些进步带来了一系列不断变化的道德问题和风险,涵盖从个人到社会层面。随着这些技术的成熟,理解和驾驭这些领域变得至关重要。
The integration of AI, particularly LLMs, into our lives brings unparalleled convenience and potential. Yet, with these advances comes a set of evolving ethical concerns and risks that span from individual to societal levels. As these technologies mature, understanding and navigating these areas becomes crucial.
人工智能伦理是指指导人工智能设计、部署和使用的道德原则。它围绕确保人工智能系统的公平、透明、隐私和问责制。早期的人工智能应用还很初级,造成的道德困境较少。随着人工智能复杂性的增加,其决策的后果也随之增加,将道德推到了最前沿。LLMs的出现及其生成类似人类文本的能力,进一步放大了这些担忧。
Ethics in AI refers to the moral principles guiding AI design, deployment, and use. It revolves around ensuring fairness, transparency, privacy, and accountability in AI systems. Early AI applications, being rudimentary, posed fewer ethical dilemmas. As AI’s complexity grows, so do the consequences of its decisions, pushing ethics to the forefront. The emergence of LLMs, with their ability to generate human-like text, further amplified these concerns.
主要的道德问题如下:
The key ethical concerns are as follows;
主要风险如下:
The key risks are as follows:
随着人工智能的快速发展,这些担忧也在迅速增长。虽然快速进步意味着进步和新的可能性,但它们也给政策制定者和伦理学家带来了挑战。随着人工智能系统变得更加复杂和强大,它们的发展速度往往超过了道德准则和监管措施的发展速度。这意味着,当我们利用最新的人工智能突破时,我们可能会在没有道德指南针或安全网的情况下冒险进入未知领域。人工智能进化的敏捷性也给企业带来了挑战企业和政府。他们必须不断适应,以确保其实践、法规和标准跟上最新发展。
These concerns are growing quickly as AI is rapidly advancing. While rapid advancements signify progress and new possibilities, they also introduce challenges for policymakers and ethicists alike. As AI systems become more complex and capable, they often outpace the development of ethical guidelines and regulatory measures. This means that as we harness the latest AI breakthroughs, we may be venturing into uncharted territories without a moral compass or safety net. The agility of AI evolution also poses challenges for businesses and governments. They must constantly adapt to ensure that their practices, regulations, and standards keep up with the latest developments.
看待这些问题的另一个视角是社会规模。一方面是个人层面,关注点围绕隐私、数据滥用和个人偏见。人们发现自己很难区分人工智能生成的内容和人类生成的内容。我们目睹的一个日益严重的问题是错误信息的传播,无论是有意还是无意。这种现象有可能动摇个人对民选官员、法律程序和其他社会支柱的信心。
Another lens to view these concerns through is the scales of society. On one end is the individual level, where concerns revolve around privacy, data misuse, and personal biases. Individuals find themselves struggling to decipher between AI-generated content and human-generated content. A growing problem we have been witnessing is the spread of misinformation, whether intentional or accidental. This phenomenon is threatening to shake the confidence individuals have in elected officials, legal procedures, and other pillars of society.
在公司层面,组织面临着确保其人工智能系统公平、透明且符合法规的挑战。他们还面临着因有偏见或有问题的人工智能输出而导致声誉受损的风险。
On the company level, organizations face challenges in ensuring their AI systems are fair, transparent, and compliant with regulations. They also risk reputational damage from biased or questionable AI outputs.
在宏观层面上,社会必须解决人工智能更广泛的影响,从自动化导致的潜在失业到人工智能的歧视性决策可能引起的社会分裂。
On a macro scale, societies must address the broader implications of AI, from potential job losses due to automation to the societal divisions that might arise from AI’s discriminatory decisions.
当我们正处于人工智能的影响几乎渗透到我们生活的方方面面的时代边缘时,几个关键趋势将塑造我们共同的未来。首先也是最重要的是,对道德准则和人工智能开发和部署中的框架从未如此响亮。在认识到人类福祉在这个数字时代的首要重要性后,围绕创建优先考虑和保护人类利益的人工智能系统建立了巨大的动力。这超出了单纯的合规或经济考虑;这是为了确保未来的人工智能系统与我们共同的人类价值观产生共鸣,并为更大的利益做出贡献。
As we stand on the brink of an era where AI’s influence permeates nearly every facet of our lives, several key trends shape our collective future. First and foremost, the cry for ethical guidelines and frameworks in AI development and deployment has never been louder. In recognizing the chief importance of human welfare in this digital age, there’s significant momentum building around creating AI systems that prioritize and protect human interests. This goes beyond mere compliance or economic considerations; it’s about ensuring that the AI systems of tomorrow resonate with our shared human values and contribute to the greater good.
在强调道德的同时,各国政府和全球实体正在准备采取更加实际的方法。自由贸易或对人工智能不干涉态度的时代正在消失。相反,人们期望制定强有力的法规,不仅能跟上人工智能的进步,还能确保其负责任和公平的使用。此类法规可能会涵盖从数据隐私和安全到透明度和公平性的一系列问题,从而确保企业和个人遵守一套全球公认的最佳实践。
Parallel to the emphasis on ethics, governments, and global entities are gearing up for a more hands-on approach. The era of free trade or hands-off attitudes toward AI is fading. Instead, there’s an anticipation of robust regulations that not only keep pace with AI advancements but also ensure its responsible and equitable use. Such regulations will likely cover a spectrum of concerns, from data privacy and security to transparency and fairness, thus ensuring that corporations and individuals alike adhere to a set of globally recognized best practices.
2023 年,OpenAI 首席执行官 Sam Altman 出现在美国国会,分享了他对监管不断扩大的人工智能领域的必要性的看法。他强调了谨慎的重要性,并指出人类历史上如此有影响力的转变需要采取适当的保障措施,以确保其负责任和有益的实施。Altman 观点的核心是他相信人工智能模型的力量很快就会超出我们最初的预期,使它们既是无价的工具,也是前所未有的挑战的潜在来源。他热情倡导政府积极主动的监管干预,并声称此类措施对于解决和减轻这些日益复杂的模型的相关风险至关重要。
In 2023, Sam Altman, OpenAI’s CEO, appeared before the US Congress to share his perspective on the need to regulate the expanding AI landscape. He emphasized the importance of caution, stating that such influential shifts in human history necessitate appropriate safeguards to ensure their responsible and beneficial implementation. Central to Altman’s argument was his belief that the power of AI models would soon exceed our initial expectations, making them both invaluable tools and potential sources of unprecedented challenges. He passionately advocated for proactive regulatory intervention by governments, asserting that such measures would be crucial to address and mitigate the associated risks of these increasingly sophisticated models.
纽约大学名誉教授加里·马库斯提出了另一种观点,建议建立更强有力的监督机制。他提议成立一个新的联邦机构,专门负责审查人工智能项目。该机构的职责是在这些项目公开之前对其进行审查,确保其安全性、道德考虑和有效性。马库斯提请人们关注人工智能的快速发展,他对不可预见的进步提出了警告,他比喻性地说,“更多的瓶子里会出现更多的精灵。”
Gary Marcus, Professor Emeritus at New York University, introduced another perspective, suggesting a more robust oversight mechanism. He proposed the establishment of a new federal agency dedicated to reviewing AI programs. This agency’s role would be to scrutinize these programs before they are made publicly available, ensuring their safety, ethical considerations, and effectiveness. Drawing attention to the rapid evolution of AI, Marcus cautioned about unforeseen advancements, metaphorically stating, “There are more genies to come from more bottles.”
我们期望看到以护栏形式出现的重大行动,无论是市政还是组织的治理,将决定要执行和维持的界限还有待观察。这将解决敏感领域,例如将LLMs用于医疗保健相关事务、财务决策、未成年人的使用以及其他需要高度责任感的事务。特别是,我们希望能够明确允许使用哪些数据以及在什么情况下使用哪些数据来训练模型。
We expect to witness major actions in the form of guardrails, whether governance, either municipal or organizational, will dictate the bounds that are to be enforced and maintained remains to be seen. This will address sensitive domains such as using LLMs for healthcare-related matters, financial decisions, usage by minors, and other matters that require a high sense of responsibility. In particular, we expect there to be clarity regarding what data is allowed to be used to train a model and in what circumstances.
然而,法规和道德框架虽然至关重要,但只是其中的一部分。最终用户——公众——在塑造人工智能发展轨迹方面发挥着关键作用。随着人工智能技术成为日常生活中不可或缺的一部分,从智能家居到个性化医疗保健,迫切需要围绕其道德考虑和相关风险进行公众讨论。这种对话将培养更知情、更强大的用户群,使其能够对他们所使用的人工智能工具做出明智的选择。教育活动、研讨会和公共辩论可能会激增,创造一个环境,让每个人不再只是被动的消费者,而是知情的利益相关者。
However, regulations and ethical frameworks, while being vital, are only part of the equation. The end-users—the general public—play a pivotal role in shaping AI’s trajectory. As AI technologies become an integral part of daily life, from smart homes to personalized healthcare, there’s a pressing need for public discourse around its ethical considerations and associated risks. This dialogue will foster a more informed and empowered user base capable of making discerning choices about the AI tools they engage with. Education campaigns, workshops, and public debates will likely surge, creating an environment where every individual is not just a passive consumer but an informed stakeholder.
最后,技术前沿将见证某种复兴。唯一关注的是创建最强大或最高效的人工智能模型的日子已经一去不复返了。研究人员和开发人员现在越来越多地致力于创建本质上更加透明、公平且能够抵御潜在威胁的人工智能系统。愿景很明确:人工智能模型不仅在任务中表现出色,而且以易于理解、公平且不受恶意攻击的方式完成任务。
Lastly, the technological front is set to witness a renaissance of sorts. Gone are the days when the sole focus was on creating the most powerful or efficient AI model. Researchers and developers are now increasingly dedicating their efforts toward creating AI systems that are intrinsically more transparent, fair, and resilient against potential threats. The vision is clear: AI models that not only excel in their tasks but do so in a manner that’s comprehensible, equitable, and impervious to malicious attacks.
从本质上讲,人工智能的未来不仅仅在于技术奇迹,还在于技术奇迹。它是将创新与责任、权力与透明度、进步与道德相结合。当我们迈向这个未来时,这些趋势的融合预示着一个人工智能丰富生活、维护价值观并服务于社会集体进步的世界。
In essence, the future of AI is not just about technological marvels; it’s about blending innovation with responsibility, power with transparency, and progress with ethics. As we march into this future, the confluence of these trends promises a world where AI enriches lives, upholds values, and serves the collective betterment of society.
综上所述,关系人工智能、道德和风险之间的关系是多方面的。虽然人工智能,尤其是LLMs,拥有巨大的潜力,但必须认识并解决随之而来的道德困境和风险。只有通过平衡我们可以利用人工智能的优势,同时维护个人和社会利益。
In summary, the relationship between AI, ethics, and risk is multifaceted. While AI, especially LLMs, holds vast potential, it’s imperative to recognize and address the accompanying ethical dilemmas and risks. Only through a balanced approach can we harness AI’s benefits while safeguarding individual and societal interests.
在本章中,我们全面了解了塑造人工智能世界的主要趋势,特别强调了LLMs。这些模型的核心在于计算能力,它充当驱动引擎,实现突破并放大其潜力。随着计算能力的进步,我们不仅进步得更快,而且还释放了新的效率,重新定义了可能性的领域。
In this chapter, we embarked on a comprehensive journey through the key trends shaping the world of AI, with a particular emphasis on LLMs. At the very heart of these models lies computational power, which acts as the driving engine, enabling breakthroughs and amplifying their potential. With advancements in computational capabilities, we’re not only progressing faster but also unlocking new efficiencies that redefine the realm of possibilities.
庞大的数据集补充了这种计算能力,为自然语言处理和LLMs留下了不可磨灭的印记。我们在本章中介绍了它们的重要性,并了解到它们发挥着关键作用。展望未来,NLP 中数据可用性的未来有望成为一个动态的格局,不断发展以应对这些挑战。
Complementing this computational prowess are vast datasets, casting an indelible mark on NLP and LLMs. We have covered their significance in this chapter and learned that they serve pivotal roles. As we look ahead, the future of data availability in NLP promises to be a dynamic landscape, constantly evolving in response to these challenges.
LLMs本身已经经历了重大演变;每次迭代的目的都是为了实现更大的规模和能力。我们回顾了这些模型所具有的影响,并了解到它们无可否认地改变了从商业到社会互动的各种景观,为未来的创新铺平了道路。
LLMs themselves have undergone significant evolution; each iteration aimed at achieving greater scale and capability. We reviewed the impact these models possess and learned that they have undeniably transformed various landscapes, from business to social interactions, paving the way for innovations yet to come.
NLP 和 LLM 的文化足迹在商业世界中显而易见,它们重塑了客户交互、重新定义了内部业务结构,甚至导致了 CAIO 等专业角色的出现。这些进步虽然令人印象深刻,但也预示着行为转变的新时代。从日常任务到高层商业决策,人工智能对社会结构的影响是深远的。
The cultural footprint of NLP and LLMs is evident in the business world, reshaping customer interactions, redefining internal business structures, and even leading to the emergence of specialized roles such as the CAIO. These advancements, while impressive, also herald a new era of behavioral shifts. From day-to-day tasks to high-level business decisions, AI’s influence on society’s fabric is profound.
然而,与这些进步交织在一起的是,人们越来越担心人工智能的道德实施和相关风险。人工智能的快速发展、决策过程的不透明以及数据滥用的可能性,凸显了对道德准则、强有力的监管和提高公众意识的迫切需要。最后,随着人工智能继续不懈地前进,我们必须既对其潜力充满热情,又对其挑战保持谨慎,以确保未来技术以最负责任和最有益的方式为人类服务。
Yet, intertwined with these advancements are growing concerns about the ethical implementation and associated risks of AI. The rapid pace of AI’s progression, the opacity of its decision-making processes, and the potential for data misuse underscore the urgent need for ethical guidelines, robust regulations, and increased public awareness. In closing, as AI continues its relentless march forward, it is imperative to approach it with both enthusiasm for its potential and caution for its challenges, ensuring a future where technology serves humanity in the most responsible and beneficial ways.
随着本书旅程的展开,探索自然语言处理( NLP ) 和大型语言模型( LLM ) 的广阔领域,我们在第 11 章到达了一个关键时刻。本章不仅是前面主题和讨论的高潮,也是通向 NLP 和 LLM 领域尚未开发的潜力和迫在眉睫的挑战的桥梁。我们在各章中的努力是描绘 NLP 从基本概念到LLMs建筑奇迹的演变,剖析机器学习( ML ) 策略、数据预处理、模型训练以及改变行业和社会互动的实际应用的复杂性。
As the journey of this book unfolds, exploring the vast expanse of natural language processing (NLP) and large language models (LLMs), we arrive at a pivotal juncture in Chapter 11. This chapter is not just a culmination of the themes and discussions that preceded it but also a bridge to the untapped potential and imminent challenges that lie ahead in the realm of NLP and LLMs. Our endeavor through the chapters has been to chart the evolution of NLP from its foundational concepts to the architectural marvels of LLMs, dissecting the intricacies of machine learning (ML) strategies, data preprocessing, model training, and the practical applications transforming industries and societal interactions.
本章的动机源于对 NLP 和 LLM 技术发展速度以及它们对我们数字社会结构产生的多方面影响的敏锐认识。当我们探索这些先进模型的复杂性及其所激发的趋势时,有必要向那些处于创新、研究和道德思考前沿的人们寻求指导。与不同领域(法律、研究和执行)专家的对话是了解LLMs如何与专业实践的各个方面交叉以及未来发展轨迹的灯塔。
The motivation for this chapter stems from an acute recognition of the pace at which NLP and LLM technologies are evolving and the multifaceted impact they wield on the fabric of our digital society. As we explore the complexities of these advanced models and the trends they spur, it is essential to seek guidance from those navigating these waters at the forefront of innovation, research, and ethical contemplation. The dialogue with experts across diverse domains—legal, research, and executive—serves as a beacon for understanding how LLMs intersect with various facets of professional practice and what future trajectories might look like.
本文讨论的主题反映了本书更广泛的主题,但更深入地探讨了LLMs所面临的具体挑战和机遇。从减少数据集中的偏见到协调开放研究与隐私,从人工智能( AI ) 带来的组织重组到LLMs学习范式的演变,每一次讨论都是见解的镶嵌,描绘了当前的全面图景。状态和未来的道路。
The topics discussed herein are reflective of the broader themes of this book yet delve deeper into specific challenges and opportunities that LLMs present. From mitigating biases in datasets to reconciling open research with privacy, and from organizational restructuring in the wake of artificial intelligence (AI) to the evolving landscape of learning paradigms within LLMs, each discussion is a mosaic of insights that paints a comprehensive picture of the current state and the road ahead.
在本章中,我们将介绍以下内容:
In this chapter, we will cover the following:
我们先来看看各位专家的介绍。
Let’s go through each of the experts’ introductions first.
Nitzan Mekel-Bobrov 是eBay 的首席人工智能官( CAIO ),负责管理全公司的人工智能和技术创新战略。Nitzan 是一名训练有素的研发科学家,他的职业生涯致力于开发机器智能系统,并将其直接集成到关键任务中产品。Nitzan 领导过多个行业(包括医疗保健、金融服务和电子商务)的企业 AI 组织,是通过大规模实时 AI 提供变革性影响、改变公司的业务模式和核心价值主张以实现其目标的思想领袖。顾客。Nitzan 在芝加哥大学获得博士学位,目前居住在纽约市,担任eBay NYC 的总经理。
Nitzan Mekel-Bobrov is the Chief AI Officer (CAIO) at eBay where he runs the company-wide strategy for AI and technology innovation. An R&D scientist by training, Nitzan has spent his career developing machine intelligence systems, directly integrated into mission-critical products. Having led enterprise AI organizations across multiple industries, including healthcare, financial services, and e-commerce, Nitzan is a thought leader in the delivery of transformational impact through real-time AI at scale, changing companies’ business models and core value propositions to their customers. Nitzan received his PhD from the University of Chicago and currently resides in New York City as the GM of eBay NYC.
大卫·桑塔格 (David Sontag) 是麻省理工学院电气工程和计算机科学教授,隶属于医学工程与科学研究所以及计算机科学与人工智能实验室。他的研究重点是推进机器学习和人工智能,并利用它们来改变医疗保健。此前,他曾担任纽约大学计算机科学和数据科学助理教授,隶属于计算机智能、学习、视觉和机器人( CILVR ) 实验室。他还是Layer Health 的联合创始人兼首席执行官。
David Sontag is a Professor of Electrical Engineering and Computer Science at MIT, part of both the Institute for Medical Engineering & Science and the Computer Science & Artificial Intelligence Laboratory. His research focuses on advancing ML and AI and using these to transform healthcare. Previously, he was an Assistant Professor of Computer Science and Data Science at New York University, part of the Computer Intelligence, Learning, Vision, and Robotics (CILVR) lab. He is also Co-Founder and CEO of Layer Health.
约翰·哈拉姆卡 (John D. Halamka),医学博士、理学硕士、总裁Mayo Clinic 平台的领导者领导了一项变革性数字健康计划,到 2023 年将影响 4500 万人。他在医疗保健信息战略和急诊医学领域拥有 40 年的经验,曾在贝斯以色列女执事医疗中心( BIDMC )任职,为乔治·W·布什到巴拉克·奥巴马等政府提供咨询,并担任哈佛医学院教授。哈拉姆卡是斯坦福大学、加州大学旧金山分校和加州大学伯克利分校的校友,也是梅奥诊所医学与科学学院的急诊医学教授。他撰写了 15 本书和数百篇文章,并于 2020 年当选为美国国家医学院院士。
John D. Halamka, M.D., M.S., President of the Mayo Clinic Platform, leads a transformative digital health initiative impacting 45 million people in 2023. With over 40 years in healthcare information strategy and emergency medicine, his work spans serving at Beth Israel Deaconess Medical Center (BIDMC), advising administrations from George W. Bush to Barack Obama, and teaching as a Harvard Medical School professor. A Stanford, UCSF, and UC Berkeley alumnus, Halamka is also a practicing Emergency Medicine Professor at Mayo Clinic College of Medicine and Science. An author of 15 books and hundreds of articles, he was elected to the National Academy of Medicine in 2020.
Xavier Amatriain 最近担任 LinkedIn 人工智能产品战略副总裁,领导全公司范围内的生成式人工智能工作,从平台和基础设施到产品特点。他还是 Curai Health 的董事会成员,这是一家医疗保健/人工智能初创公司,由他共同创立,并担任首席技术官直至 2022 年。在此之前,他在 Quora 领导工程,并在 Netflix 担任研究/工程总监,并在 Netflix 创办和领导算法团队构建了著名的 Netflix 推荐。泽维尔的职业生涯始于学术界和工业界的研究员。他发表了 100 多篇研究论文(引用次数超过 6,000 次),以其在 AI 和 ML(尤其是推荐系统)方面的工作而闻名。
Xavier Amatriain was most recently VP of AI Product Strategy at LinkedIn, where he led company-wide generative AI efforts all the way from platform and infrastructure to product features. He is also a board member of Curai Health, a healthcare/AI start-up that he cofounded and was CTO of until 2022. Prior to this, he led engineering at Quora and was Research/Engineering Director at Netflix, where he started and led the Algorithms team building the famous Netflix recommendations. Xavier started his career as a researcher both in academia and industry. With over 100 research publications (and 6,000 citations), he is best known for his work on AI and ML in general, and recommender systems in particular.
Melanie Garson博士,网络托尼·布莱尔研究所政策与技术地缘政治主管,深入研究网络政策、地缘政治人工智能、计算和互联网、科技公司作为地缘政治参与者的崛起、数据治理,以及颠覆性技术、外交政策、国防、和外交。她是伦敦大学学院的副教授,教授新兴技术对冲突、谈判和技术外交的影响。Melanie 经常在 BBC 和 CNN 等国际论坛和媒体上发表演讲,她的背景包括担任 Freshfields Bruckhaus Deringer 的认可调解员和律师。她拥有伦敦大学学院的博士学位和弗莱彻法律与外交学院的硕士学位。
Dr. Melanie Garson, Cyber Policy & Tech Geopolitics Lead at the Tony Blair Institute, delves into cyber policy, geopolitics AI, compute and the internet, the rise of tech companies as geopolitical actors, data governance, as well as the intersection of disruptive tech, foreign policy, defense, and diplomacy. At University College London, she’s an Associate Professor teaching on the impact of emerging technologies on conflict, negotiation, and tech diplomacy. A regular speaker at international forums and media, including BBC and CNN, Melanie’s background includes being an accredited mediator and solicitor at Freshfields Bruckhaus Deringer. She holds a PhD from University College London and a master’s from the Fletcher School of Law and Diplomacy.
我们有机会听取这些经验丰富的人士的意见,了解他们的职业生涯如何交叉以及如何利用人工智能和LLMs。我们为他们每个人量身定制了问题,以便他们能够通过他们的见解和观点来教导我们。我们发现这些讨论是有益的,因为它们揭示了常见的主题,并且对任何阅读本书的人都很有价值。让我们开始吧。
We had an opportunity to pick the brains of each of these experienced folks and learn about how their career intersects and leverage AI and LLMs. We tailored questions to each of them so to allow them to teach us through their insights and perspectives. We found these discussions to be rewarding as they shed light on topics that are common and would be valuable for anyone who reads this book. Let’s dive right in.
Nitzan 带来了 CAIO 的观点,因为他和 Ebay 正在遇到人工智能和LLMs所提供的巨大潜力。他分享了 CAIO 必须解决和决定的许多多元化方面。
Nitzan brings the CAIO’s perspective as he and Ebay are encountering the vast potential that AI and LLM’s have to offer. He shares many diversified aspects that the CAIO has to address and decide on.
让我们与Nitzan Mekel-Bobrov一起回顾一下问题和答案。
Let’s go through the questions and answers with Nitzan Mekel-Bobrov.
在思考LLMs内结合不同学习范式的潜在下一个突破时,我可以阐明以下想法:
In thinking about the potential next breakthrough in combining different learning paradigms within LLMs, I can articulate these ideas:
这些想法预示着未来人工智能模型不仅更加高效、可扩展,而且更加智能,能够进行细致入微的理解和推理。对多模态、可扩展性、实时优化和增强推理能力的强调,凸显了人工智能向更全面、更类人的智能和实用性发展的方向。
These ideas point toward a future where AI models are not only more efficient and scalable but also significantly more intelligent and capable of nuanced understanding and reasoning. The emphasis on multimodality, scalability, real-time optimization, and enhanced reasoning capabilities highlights the direction of AI development toward more holistic, human-like intelligence and utility.
使用多个LLMs可以超越验证和减少幻觉的概念。更广泛的想法,有时称为 K-LLM,可以利用多个 LLM 来回答问题或创建复杂的解决方案。如前所述,其中一种方案可能是每个模型检查彼此的答案以验证响应。另一种可能的方法是为他们分配角色,每个角色都有其特定的专业(例如产品经理、设计师、前端工程师、后端工程师和 QA 工程师),然后他们迭代解决方案,形成专家团队。这还可以允许更小的、更专业的LLMs,因此训练成本更低、处理速度更快、计算要求更小。
The use of multiple LLMs can go beyond the notion of validation and reducing hallucinations. A broader idea, sometimes referred to as K-LLMs, can utilize multiple LLMs to answer a question or create a complex solution. One such scheme, as discussed previously, could be where each of the models checks each other’s answers to validate responses. A possible other approach is where they are assigned roles where each has its particular specialty (for example, product manager, designer, frontend engineer, backend engineer, and QA engineer) and they iterate over the solution, forming a team of experts. This can also allow for smaller and specialized LLMs, which are thus cheaper to train, quicker to process, and smaller in computation requirements.
作为首席人工智能官,我的角色涵盖人工智能对我们组织内各个领域的广泛影响。以下是我关注的一些最重要的领域:
As the Chief AI Officer, my role encompasses navigating the expansive impact AI has across various domains within our organization. Here are some of the most significant areas of focus for me:
在监管方面,我花了相当多的时间与我们的法律团队、合规官员和信息安全人员进行讨论。人工智能监管的前景在很大程度上是未知的,这意味着在先例很少的情况下制定指导方针和护栏。理想情况下,我寻求明确的该做和不该做的事情,但通常情况下,定义这些指导方针需要共同努力。此次持续对话的重点是管理风险、保护我们的客户和推进创新,同时最大限度地降低我们的风险敞口。
On the regulation front, I spend a considerable amount of time in discussions with our legal team, compliance officers, and information security personnel. The landscape for AI regulation is largely uncharted, which means crafting guidelines and guardrails where precedents are scant. Ideally, I seek clear dos and don’ts, but often, it’s a collaborative effort to define these guidelines. This ongoing conversation focuses on managing risk, protecting our customers, and advancing innovation while minimizing our risk exposure.
我们建立了负责任的人工智能办公室,负责为人工智能应用程序定义适当的业务环境。这项工作的大部分涉及道德考虑,而不仅仅是法律合规性,特别是因为法规往往针对高风险领域。然而,大约 90% 的典型公司运营不属于这些高风险类别,使我们处于监管灰色地带。在这里,道德判断变得至关重要。虽然我赞成新兴的全球法规,但我认识到它们提供了一个框架,而不是一个完整的解决方案。这些法规主要针对高风险领域,仍然需要我们在日常运营中细致入微地应用。
We’ve established an Office of Responsible AI, tasked with defining the appropriate business contexts for AI applications. Much of this work involves navigating ethical considerations beyond mere legal compliance, especially since regulations tend to address high-risk areas. However, about 90% of typical company operations fall outside these high-risk categories, placing us in a regulatory gray area. Here, ethical judgment becomes paramount. While I am in favor of the emerging global regulations, I recognize they provide a framework rather than a complete solution. These regulations, focusing primarily on high-risk areas, still require nuanced application in our daily operations.
从本质上讲,我作为首席信息官的角色需要一种多功能的方法来平衡技术专业知识、道德远见和战略规划。这是关于负责任地、有效地利用人工智能的潜力,引导人工智能在整个业务中的广泛适用性以及不断发展的人工智能道德和法规格局。
In essence, my role as CAIO demands a versatile approach that balances technical expertise, ethical foresight, and strategic planning. It’s about harnessing AI’s potential responsibly and effectively navigating both the broad applicability of AI across the business and the evolving landscape of AI ethics and regulations.
作为首席人工智能官,我发现自己经常思考专有数据所有权在当前人工智能驱动的业务范式中的重要性的转变。一方面,基础模型正在使人工智能民主化,显着降低了缺乏广泛专有数据集的公司的进入门槛。这些模型提供的性能看起来与经过专门的专有数据训练一样强大。这一趋势可能表明,拥有独特数据集的价值可能正在减弱,因为更广泛的实体无需大量数据资产即可获得强大的人工智能功能。
As Chief AI Officer, I find myself frequently contemplating the shifting significance of proprietary data ownership within our current AI-driven business paradigm. On one hand, foundation models are democratizing AI, significantly lowering the barrier to entry for companies that lack extensive proprietary datasets. These models offer performance that appears just as robust as if they were trained on specialized, proprietary data. This trend could suggest that the value of owning unique datasets may be diminishing, as powerful AI capabilities become accessible to a wider range of entities without substantial data assets.
然而,情况是微妙的。我们目睹了微调和额外预训练等技术的兴起,这些技术根据特定需求定制这些通用模型,巧妙地恢复了独特数据的重要性。这种定制功能暗示数据所有权的相关性可能会不断发展而不是减弱,从而成为新的竞争优势或进入壁垒。
However, the landscape is nuanced. We’re witnessing a rise in techniques such as fine-tuning and additional pre-training, which tailor these generalist models to specific needs, subtly reinstating the importance of unique data. This customization capability hints that data ownership might evolve rather than diminish in relevance, serving as a new competitive edge or barrier to entry.
此外,Meta 等大公司将人工智能解决方案开源的战略重点并不纯粹是无私的,而是旨在打破现状,挑战微软和谷歌等巨头的主导地位。这种向开源的转变正在重塑整个行业,迫使这些巨头围绕其模型提供更全面、面向企业的生态系统来增强其产品。最终的价值主张不再只是模型本身,而是整个包——生态系统支持它们,使它们对企业应用程序有吸引力。
Furthermore, the strategic pivots of major companies such as Meta toward open sourcing their AI solutions are not purely altruistic but are aimed at disrupting the status quo, challenging the dominance of giants such as Microsoft and Google. This move toward open sourcing is reshaping the industry, compelling these giants to augment their offerings with more comprehensive, enterprise-oriented ecosystems around their models. The ultimate value proposition is no longer just the models themselves but the entire package—the ecosystems that support them, making them appealing for enterprise applications.
其中,监管机构的作用以及对数据隐私和共享的不同国际立场发挥作用,可能将市场引向不同的方向。这创造了一个复杂的环境,企业不仅必须驾驭技术进步,还必须驾驭可能影响数据所有权战略价值的监管环境。
Amidst this, the role of regulators and differing international stances on data privacy and sharing come into play, potentially steering the market in various directions. This creates a complex environment where businesses must navigate not only technological advancements but also regulatory landscapes that could influence the strategic value of data ownership.
总之,虽然通过基础模型和开源计划实现的人工智能民主化挑战了传统的数据所有权概念,但它同时为竞争差异化开辟了新途径。企业必须保持敏捷,根据这些发展重新评估其数据策略,以有效利用人工智能,同时应对这一不断变化的环境中的监管和战略细微差别。
In conclusion, while the democratization of AI through foundation models and open source initiatives challenges traditional notions of data ownership, it simultaneously opens new avenues for competitive differentiation. Businesses must stay agile, reevaluating their data strategies in light of these developments, to leverage AI effectively while navigating the regulatory and strategic nuances of this evolving landscape.
David 拥有长期的学术研究记录,并与行业参与和合作相结合。在本节中,他分享了他对LLMs一些新兴发展的新颖见解。
David has a long track record of academic research which he dovetails with industry engagements and collaborations. In this section, he shares his novel insights on some of the emerging developments in LLMs.
让我们和大卫·桑塔格一起回顾一下问题和答案。
Let’s go through the questions and answers with David Sontag.
在医疗保健领域,机器学习的应用不仅仅局限于预测分析,还可以培养能够从根本上改变患者护理和结果的见解。捕捉健康的微妙社会决定因素(生活条件、粮食安全和交通便利等变量)对健康结果有显着影响,这凸显了该领域的复杂性。然而,当前的数据收集和模型训练往往忽视了患者生活中这些关键但难以量化的方面,导致机器学习预测的个性化应用存在差距。
In the realm of healthcare, the application of ML extends beyond mere predictive analytics to fostering insights that can fundamentally alter patient care and outcomes. This domain’s complexity is underscored by the challenge of capturing the nuanced social determinants of health—variables such as living conditions, food security, and access to transportation—that significantly influence health outcomes. However, the current landscape of data collection and model training often overlooks these critical, yet less quantifiable aspects of patient life, leading to a gap in the personalized application of ML predictions.
一个主要问题是由于对数据集中的代理或代理的依赖而产生的,而这些代理或代理无法完全概括个人的复杂性。这种依赖可能会掩盖每个患者固有的微妙之处,从而削弱机器学习对医疗保健环境产生有意义的改变的潜力。数据模型的训练内容与它们所应用的现实环境之间的差异使这个问题进一步复杂化。例如,接受通用文本数据培训的LLMs缺乏细致入微的应用程序所需的上下文丰富性,例如根据个人社会环境定制医疗保健建议。
A predominant issue arises from the reliance on surrogates or proxies in datasets that fail to encapsulate the individual’s complexity fully. This reliance can obscure the subtleties inherent to each patient, thereby diluting the potential for ML to effect meaningful change in healthcare settings. The disparity between what the data models are trained on and the real-world contexts they are applied to further complicates this issue. For instance, LLMs trained on generic text data lack the contextual richness necessary for nuanced applications, such as tailoring healthcare recommendations to individual social circumstances.
这种脱节不仅妨碍了模型提供相关见解的效用,而且还引入了意想不到的偏差。当模型缺乏上下文或不了解训练数据的局限性,将广义预测错误地应用于个别案例时,就会出现这些偏差。应对这一挑战需要齐心协力丰富数据收集流程,以更全面地了解患者的社会决定因素,并确保模型能够有效地解释和应用这些信息。
This disconnect not only hampers the model’s utility in providing relevant insights but also introduces unintended biases. These biases emerge when models, devoid of context or unaware of their training data’s limitations, misapply generalized predictions to individual cases. Addressing this challenge requires a concerted effort toward enriching data collection processes to capture a more comprehensive view of patient social determinants and ensuring models can interpret and apply this information effectively.
为了减轻大型数据集中的隐性偏差并推进公平的机器学习模型,注重数据收集、分析和模型细化的多方面方法至关重要。关键策略包括将歧视指标分解为偏差、方差和噪声(“为什么我的分类器具有歧视性? ”),以识别不公平的具体来源,强调关键的因素上下文丰富且大小适当的训练样本在提高公平性和准确性方面的作用。
To mitigate implicit biases in large datasets and advance toward equitable ML models, a multifaceted approach focusing on data collection, analysis, and model refinement is essential. Key strategies include decomposing discrimination metrics into bias, variance, and noise (“Why is my classifier discriminatory?”) to identify specific sources of unfairness, emphasizing the critical role of contextually rich and adequately sized training samples to improve both fairness and accuracy.
此外,使用更具代表性的样本和相关变量来扩充数据集可以解决不同群体之间预测性能的差异(“机器学习中存在偏差的潜力以及健康保险公司解决这一问题的机会”)。实施这些策略需要对模型的输出和影响进行严格、持续的评估,确保它们不会延续现有的偏见或引入新的偏见。行业在算法警惕、敏感数据的道德使用以及在模型开发过程中纳入不同观点方面的合作也至关重要。通过优先考虑公平性作为模型准确性和实用性的一个基本方面,我们可以利用机器学习在各个部门提供更加公正和公平的结果。
Additionally, augmenting datasets with more representative samples and relevant variables can address disparities in predictive performance across different groups (“The Potential For Bias In Machine Learning And Opportunities For Health Insurers To Address It”). Implementing these strategies necessitates a rigorous, ongoing evaluation of model outputs and impacts, ensuring they do not perpetuate existing biases or introduce new ones. Collaborative industry efforts toward algorithmic vigilance, ethical use of sensitive data, and incorporating diverse perspectives in model development processes are also vital. By prioritizing fairness as a fundamental aspect of model accuracy and utility, we can leverage ML to deliver more just and equitable outcomes across sectors.
总之,在深入研究创建前面概述的公平且公正的数据集的策略之前,认识到 ML 在医疗保健领域面临的基本挑战至关重要。这些挑战包括需要更深入地了解患者的社会决定因素,以及弥合数据模型训练内容和部署环境之间差距的必要性。解决这些问题是充分利用机器学习改善医疗保健结果并确保机器学习创新为患者护理做出积极和公平贡献的先决条件。
In summary, before delving into strategies for creating equitable and unbiased datasets as outlined previously, it’s crucial to acknowledge the foundational challenges faced by ML in healthcare. These challenges include the need for a deeper understanding of patient social determinants and the imperative to bridge the gap between what the data models are trained on and the contexts in which they are deployed. Addressing these issues is a prerequisite for leveraging ML to its fullest potential in improving healthcare outcomes and ensuring that innovations in ML contribute positively and equitably to patient care.
随着 NLP 技术的不断发展,增强其实用性和公平性的策略也在不断进步,特别是在麻省理工学院 David Sontag 团队领导的工作中。 David 分享了他们在实验室中领先的三项研究进展:
As NLP technologies continue to evolve, strategies to enhance their utility and fairness are also advancing, particularly in the work led by David Sontag’s team at MIT. David shared these three research advancements that they are leading in the lab:
这些进步强调了对提高 NLP 技术的灵活性、透明度和适用性的更广泛承诺。通过专注于这些关键领域,大卫·桑塔格在麻省理工学院的研究旨在推动该领域向前发展,确保 NLP 工具不仅更强大,而且对于各个领域的用户来说更容易使用、理解和合乎道德。这种方法符合学术和实践卓越的最高标准,有望塑造医疗保健及其他领域的下一代 NLP 应用。
These advancements underscore a broader commitment to improving the flexibility, transparency, and applicability of NLP technologies. By focusing on these key areas, David Sontag’s research at MIT aims to propel the field forward, ensuring NLP tools are not only more powerful but also more accessible, understandable, and ethical for users across various sectors. This approach aligns with the highest standards of academic and practical excellence, promising to shape the next generation of NLP applications in healthcare and beyond.
在围绕人工智能不断发展的监管环境中,LLMs的未来发展正在产生重大影响。随着法规不断推进,重点关注人工智能安全,包括对国家安全威胁和人工智能使用道德的担忧,LLMs开发和部署的框架正在重塑:
In the evolving regulatory landscape surrounding AI, significant implications are emerging for the future development of LLMs. As regulations continue to advance, focusing on AI safety, including concerns around national security threats and the ethical use of AI, the framework within which LLMs are developed and deployed is being reshaped:
大卫·桑塔格 (David Sontag) 的见解所预测的这些发展强调了LLMs不仅技术先进,而且具有道德基础和合规性的未来。这一轨迹确保了LLMs越来越多地融入各个领域,并以优先考虑安全、公平和透明度的方式进行。这种方法不仅符合学术卓越的最高标准,而且使LLMs能够对社会产生积极和负责任的影响。
These developments, forecasted by David Sontag’s insights, underscore a future where LLMs are not only technologically advanced but also ethically grounded and regulatory compliant. This trajectory ensures that as LLMs become more embedded in various sectors, they do so in a manner that prioritizes safety, fairness, and transparency. Such an approach not only aligns with the highest standards of academic excellence but also positions LLMs to make a positive and responsible impact on society.
约翰将执行方面带入本章。在本节中,他专门阐述了自己的观点,提出了广泛的见解和行动,公司和组织可以推出这些见解和行动,以便在严格监控和负责任的方向上实现人工智能的进步。
John brings the executive aspect to this chapter. In this section dedicated to his perspectives, he lays a broad spectrum of insights and actions that companies and organizations can roll out so to enable AI advancements in a very monitored and responsible orientation.
让我们与 John D. Halamka一起回顾一下问题和答案。
Let’s go through the questions and answers with John D. Halamka.
为了协调 NLP 社区内开放、可重复研究的需求与保护个人隐私,Mayo Clinic 平台首创的“数据背后的玻璃”模型提供了一个引人注目的解决方案。该模型代表了处理敏感健康数据的范式转变,体现了以平台为中心的方法,可确保数据质量、法规遵从性,最重要的是在整个数据生命周期中维护患者的信任。
In reconciling the need for open, reproducible research with the protection of individual privacy within the NLP community, the “Data Behind Glass” model pioneered by the Mayo Clinic Platform offers a compelling solution. This model represents a paradigm shift in the handling of sensitive health data, embodying a platform-centric approach that ensures data quality, regulatory compliance, and, above all, the maintenance of patient trust throughout the data’s life cycle.
Mayo Clinic Platform Connect 的核心是分布式数据网络,体现了联合架构。在此网络中,合作伙伴贡献其独特的数据集,同时保留对其数据的严格控制,从而在其组织 IT 范围内保护隐私和机密性。这种联合方法为数据共享和利用提供了一个协作且安全的环境。
At its core, Mayo Clinic Platform Connect serves as a distributed data network that exemplifies a federated architecture. Within this network, partners contribute their unique datasets while retaining strict control over their data, safeguarding privacy and confidentiality within their organizational IT boundaries. This federated approach enables a collaborative yet secure environment for data sharing and utilization.
该模型成功的关键是数据去标识化的细致过程。通过采用符合隐私法律法规的行业认可的统计方法,数据被匿名化,确保个人隐私得到保护,同时保留数据的研究和开发价值。哈希、统一日期移位和标记化等技术用于混淆数据,促进其在联邦学习中的使用,同时又不损害患者隐私。
Key to the success of this model is the meticulous process of data de-identification. By employing industry-accepted statistical methods aligned with privacy laws and regulations, data is rendered anonymous, ensuring that individual privacy is preserved while retaining the data’s value for research and development. Techniques such as hashing, uniform date-shifting, and tokenization are utilized to obfuscate data, facilitating its use in federated learning without compromising patient privacy.
此外,Connect 的安全设计理念可确保数据和知识产权( IP ) 始终处于各自所有者的控制之下,只有在获得授权的情况下才能访问。这种方法不仅保护隐私,还允许 Mayo Clinic 平台客户在去识别化数据队列上开发、训练和验证算法,从而促进创新。严格的控制,包括代码存储库审查、严格的访问管理以及禁止数据导入和导出,进一步强化了平台对隐私和安全的承诺。
Moreover, the secure-by-design philosophy underpinning Connect ensures that data and intellectual property (IP) remain under the control of their respective owners, accessible only as authorized. This approach not only protects privacy but also fosters innovation by allowing Mayo Clinic Platform customers to develop, train, and validate algorithms on de-identified data cohorts. Rigorous controls, including code repository reviews, strict access management, and prohibitions on data imports and exports, further reinforce the platform’s commitment to privacy and security.
“玻璃背后的数据”模型具有独特的定位,可以解决不断变化的监管环境。随着国际监管机构加强对人工智能和机器学习应用的审查,Mayo Clinic 平台的适应性框架旨在应对全球隐私的复杂拼凑法规。无论是欧盟的《通用数据保护条例》(GDPR )、巴西的《通用数据保护法》(LGPD),还是中国的安全和隐私规则,该模型都能确保合规性,同时实现全球协作。
The “Data Behind Glass” model is uniquely positioned to address the evolving regulatory landscape. With international regulators intensifying scrutiny over AI and ML applications, Mayo Clinic Platform’s adaptable framework is designed to navigate the complex patchwork of global privacy regulations. Whether it’s the General Data Protection Regulation (GDPR) in the European Union, the General Data Protection Law (LGPD) in Brazil, or China’s security and privacy rules, the model ensures compliance while enabling global collaboration.
综上所述,“Data Behind Glass”模型为 NLP 社区实现双重目标提供了一条可行的途径促进开放研究和保护隐私的目标。通过去识别化、保护和联合数据,Mayo Clinic 平台在不损害患者隐私的情况下实现了其使用的民主化,在透明度和隐私之间的平衡至关重要的时代树立了负责任的数据处理的先例。该模型体现了技术创新与对道德标准的坚定承诺如何为医疗保健及其他领域的变革性进步铺平道路,确保患者信任始终处于数字健康计划的最前沿。
In summary, the “Data Behind Glass” model presents a viable pathway for the NLP community to achieve the dual objectives of fostering open research and safeguarding privacy. By de-identifying, securing, and federating data, Mayo Clinic Platform democratizes its use without compromising patient privacy, setting a precedent for responsible data handling in an era where the balance between transparency and privacy is paramount. This model exemplifies how technical innovation, coupled with a deep commitment to ethical standards, can pave the way for transformative advances in healthcare and beyond, ensuring that patient trust remains at the forefront of digital health initiatives.
让我们首先回顾一下旨在促进医疗保健领域围绕LLMs和人工智能的使用制定政策的强大指南来源:健康人工智能联盟( CHAI™ )。
Let’s start by reviewing a strong source of guidance that seeks to promote policy making in the healthcare space around the use of LLMs and AI: The Coalition of Health AI (CHAI™).
CHAI 在其网站上谈到了以下举措:
On its website, CHAI talks about the following initiative:
“健康人工智能联盟 (CHAI™) ( https://coalitionforhealthai.org/ )正在努力制定指导方针,通过采用可信、公平和透明的健康人工智能系统来推动高质量的医疗保健。我们提供了一份蓝图草案为医疗保健 V1.0 提供值得信赖的人工智能实施指南和保证(https://coalitionforhealthai.org/insights)以供公众审查和评论” 。
"The Coalition for Health AI (CHAI™) (https://coalitionforhealthai.org/) is working to develop guidelines to drive high-quality healthcare through the adoption of credible, fair, and transparent health AI systems. We offer a draft blueprint for trustworthy AI implementation guidance and assurance for healthcare V1.0 (https://coalitionforhealthai.org/insights) for public review and comments.”
CHAI 通过制定采用可信、公平和透明的健康人工智能系统的指南,为医疗保健行业做出贡献。他们的值得信赖的人工智能实施和保证蓝图草案强调了与美国国家标准与技术研究所(NIST,隶属于美国商务部)人工智能风险管理框架保持一致的重要性,并将这些概念扩展到医疗保健领域。主要贡献包括以下内容:
CHAI contributes to the healthcare sector by developing guidelines for the adoption of credible, fair, and transparent health AI systems. Their draft blueprint for trustworthy AI implementation and assurance highlights the importance of aligning with the National Institute of Standards and Technology’s (NIST’s, under the U.S. Department of Commerce) AI risk management framework and extends these concepts to healthcare. Key contributions include the following:
CHAI 的努力旨在确保医疗保健领域的人工智能系统的开发和部署符合道德标准、增强患者护理并维护公众信任。
CHAI’s efforts aim to ensure AI systems in healthcare are developed and deployed in a manner that upholds ethical standards, enhances patient care, and maintains public trust.
“人工智能确实重塑了公司。特别是,在梅奥诊所,我们问自己这样的问题:我们应该集中人工智能运营还是将它们分散在组织内?我观察到许多应用不同方法的案例。在梅奥,我们的方法是分散所有人工智能工作,但集中数据治理和政策制定。这使得创新不留遗憾。”
“AI indeed reshapes companies. In particular, at Mayo Clinic we asked ourselves the question, should we centralize AI operations or distribute them within the organization? I have observed many cases where different approaches were applied. At Mayo, our approach has been to decentralize all AI work but centralize data governance and policymaking. That enables innovation without regret.”
让我们回顾一下这种工作模式的一些主要优点。
Let’s review some of the key benefits of this work model.
通过采用分散人工智能工作同时集中数据治理和政策制定的模型,梅奥诊所等组织可以刺激人工智能应用的创新和适应性,同时确保数据安全、质量和监管合规性。这个平衡这种方法能够实现“创新无悔”,允许以负责任和有效的方式探索和实施人工智能解决方案。
By adopting a model that decentralizes AI work while centralizing data governance and policymaking, organizations such as Mayo Clinic can stimulate innovation and adaptability in AI applications while ensuring data security, quality, and regulatory compliance. This balanced approach enables “innovation without regret,” allowing for the exploration and implementation of AI solutions in a responsible and effective manner.
医疗保险和医疗补助服务中心( CMS )关于拟议规则制定的通知对于提供有关人工智能作用的指导方针非常有帮助。约翰解释说:“该提案表示,所有人工智能都应该增强而不是取代人类决策。”
The Centers for Medicare & Medicaid Services (CMS) notice of proposed rulemaking is quite helpful in providing guidelines around the role of AI. John explains that “The proposal says that all AI should augment, not replace, human decision-making.”
我们深入研究了在线提出的提案(https://www.govinfo.gov/content/pkg/FR-2022-08-04/pdf/2022-16217.pdf)。我们特别关注第 47880 页上的“决策中临床算法的使用”(§ 92.210)部分,并得出以下结论:
We dove into the proposal, presented online (https://www.govinfo.gov/content/pkg/FR-2022-08-04/pdf/2022-16217.pdf). In particular, we focused on the Use of Clinical Algorithms in Decision-Making (§ 92.210) section on page 47880, and derived the following takeaways:
总之,CMS 的方法强调利用人工智能改善医疗保健与确保这些工具不会破坏人类判断或延续歧视之间的关键平衡。他们提出的规则和征求意见反映了为人工智能在医疗保健中的作用制定响应性和负责任的指导方针的持续努力决策。
In summary, CMS’s approach emphasizes the critical balance between leveraging AI for healthcare improvement and ensuring that these tools do not undermine human judgment or perpetuate discrimination. Their proposed rule and call for comments reflect an ongoing effort to develop responsive and responsible guidelines for AI’s role in healthcare decision-making.
让我们与Xavier Amatrian一起回顾一下问题和答案。
Let’s go through the questions and answers with Xavier Amatrian.
最重要的是要保留请记住,我们还处于LLMs研究领域的早期阶段,这是一个快速发展的领域。虽然基于注意力的Transformer已经让我们走得很远,但还有许多其他方法的空间。例如,在预训练方面,现在有很多关于后注意力的有趣研究结构化状态空间模型(SSM或S4 )等方法。同样,专家混合(MoE)虽然不是什么新鲜事,但最近证明了他们令人难以置信的能力,可以提供非常高效的小型模型,例如 Mistral AI 的 Mixtral。这只是在预训练空间中。对于对齐,我们已经看到直接偏好( DP ) 或卡尼曼特沃斯基( KT )等方法很快就显示出了很多前景。更不用说使用自我对战作为改进和调整的机制。
The most important thing to keep in mind is that we are very early in the LLM research space and this is a rapidly evolving field. While attention-based transformers have taken us very far, there is room for many other approaches. For example, on the pre-training side, there is now a lot of interesting research in post-attention approaches such as Structured State Space Models (SSMs or S4). Similarly, mixture of experts (MoEs), while not new, are recently proving their incredible power to deliver smaller models that are very efficient, such as Mixtral by Mistral AI. And this is only in the pre-training space. For alignment, we have seen approaches such as Direct Preference (DP) or Kahneman Tversky (KT) show a lot of promise very quickly. Not to mention the use of self-play as a mechanism for improvement and alignment.
我在这里要传达的主要信息是,我们应该紧紧抓住,并期待大量创新很快就会出现。未来几年。我认为几年后我们回顾过去时会发现 GPT4 架构已经过时且完全低效。非常重要的是,其中一些改进将使 LLM 的准确性更高,而且在成本和尺寸方面也更加高效,因此我们应该期望在我们的手机上运行类似 GPT4 的模型。
My main message here is that we should hold tight and expect a lot of innovation to come our way very fast in the next few years. I think in a couple of years we will look back and think of the GPT4 architecture as something old and completely inefficient. Very importantly, some of these improvements will make LLMs better in accuracy, but also much more efficient in cost and size so we should expect to have GPT4-like models running on our phones.
集成技术可以并且将会在LLMs的背景下以多种方式和场所使用。选择和组合它们的标准取决于用途和组合发生的地点。以下是结合LLMs的三个有用的地方:
There are many ways and places where ensemble techniques can and will be used in the context of LLMs. The criteria to select and combine them depends on the uses and where this combination happens. Here are three places where combining LLMs is useful:
在预训练阶段,混合专家(MoE)是一种集成形式,其中组合了不同的深度神经网络以提高输出。选择和权衡不同专家的权重是在预训练期间学习的。重要的是,其中一些权重为零,使得推理更加有效,因为并非所有任务都需要所有专家。
In the pre-training phase, Mixtures of Experts (MoEs) are a form of ensemble where different deep neural networks are combined to improve the output. The weights to select and weigh the different experts are learned during pre-training. Importantly, some of those weights are zero, making inference much more efficient since not all experts are needed for all tasks.
结合不同LLMs的另一种方法是在蒸馏阶段。在教师/学生蒸馏等某些方法中,LLMs用于生成数据,然后训练更小或更具体的模型。每个LLM的选择和权重是在学生模型的训练阶段学习的。
Another way to combine different LLMs is during the distillation phase. In some approaches such as teacher/student distillation, LLMs are used to generate data to then train a smaller or more specific model. The selection and weight of each LLM is learned during the training phase of the student model.
最后,我们可以通过将每个LLM实例视为一个代理来在应用层组合LLM。这就引出了多代理系统的概念,其中专门针对某项任务的由 LLM 支持的代理组合在一起以执行更复杂的任务。
Finally, we can combine LLMs at the application layer by treating each LLM instance as an agent. This leads to the notion of multi-agent systems where LLM-powered agents that are specialized for a task are combined to do a more complex one.
生成式人工智能将彻底改变组织的各个方面。我强烈预测人工智能将成为该组织的另一个成员。例如,软件工程师将与人工智能(或其中几个)进行日常合作。这将使它们的效率提高 100 倍,而不是 10 倍。
Generative AI is going to revolutionize every aspect of organizations. My strong prediction is that AI is going to become another member of the organization. For example, software engineers will collaborate with an AI (or several of them) in their day to day. This will make them not 10X but 100X more efficient.
当然,这样的革命性力量将改变我们组织团队、雇用人员或评估他们绩效的方式。我认为,我们为即将到来的世界做好准备非常重要,对于组织中的任何人来说,一项非常重要的技能将是他们与人工智能协作和工作的能力。
Of course, such a revolutionary force will change how we organize teams, hire people, or evaluate their performance. I think it is very important that we prepare for a world coming very soon where a very important skill for anyone in an organization will be their ability to collaborate and work with AI.
梅兰妮带来了她在法律和监管领域的丰富工作经验。随着人工智能和LLMs继续推动政策和指导方针,此类主题专业知识的价值变得更加清晰和重要。
Melanie brings her vast experience working in the legal and regulatory space. As AI and LLMs continue to drive policies and guidelines, the value of such subject matter expertise is becoming clearer and more significant.
让我们和梅兰妮·加森一起回顾一下问题和答案。
Let’s go through the questions and answers with Melanie Garson.
了解地缘政治人工智能周围的环境,包括监管、法律和风险考虑因素,对于技术从业者来说至关重要,从开发人员到主题专家(中小企业)。在人工智能领域,当公司进行战略和政策讨论时,技术娴熟的个人参与这些对话是必不可少的。决策者越来越认识到在桌面上掌握技术观点的价值,以确保决策全面并考虑到技术的可能性和局限性。
Understanding the geopolitical landscape surrounding AI, including regulatory, legal, and risk considerations, is of paramount importance for technical practitioners, from developers to subject-matter experts (SMEs). In the realm of AI, as companies navigate strategic and policy discussions, the inclusion of technically savvy individuals in these conversations is indispensable. Decision-makers increasingly recognize the value of having technical perspectives at the table to ensure that decisions are well rounded and informed by the technological possibilities and limitations.
消息灵通的技术专业人员可以有效地传达他们的见解,缩小技术潜力和执行愿景之间的差距。这种能力不仅增强了决策过程,而且还确保策略稳健、合规并了解不断变化的监管环境。
An informed technical professional can effectively communicate their insights, bridging the gap between technical potential and executive vision. This capacity not only enhances the decision-making process but also ensures that strategies are robust, compliant, and cognizant of the evolving regulatory landscape.
此外,随着组织努力使其运营符合监管要求并降低潜在风险,他们可能会建立专门的团队,负责开发和实施符合这些新战略方向的技术解决方案。熟悉影响人工智能行业的法律和监管动态的技术专家将发现自己处于显着优势,准备为这些团队做出有意义的贡献。他们的专业知识不仅使他们成为宝贵的成员,而且使他们在这些战略举措中发挥领导作用,推动在严格监管的全球市场中实现合规、创新和竞争优势。
Moreover, as organizations endeavor to align their operations with regulatory requirements and mitigate potential risks, they are likely to establish specialized teams tasked with developing and implementing technological solutions that adhere to these new strategic directions. Technical experts who are well-versed in the legal and regulatory dynamics shaping the AI industry will find themselves at a significant advantage, poised to contribute meaningfully to these teams. Their expertise not only makes them invaluable members but also primes them for leadership roles within these strategic initiatives, driving compliance, innovation, and competitive edge in a tightly regulated global market.
从法律的角度来看,人工智能技术的快速进步带来了一系列风险,这些风险可以分为几个不同的类别,每个类别都有其独特的挑战和影响。这些风险包括以下内容:
From a legal standpoint, the rapid advancements in AI technology present a spectrum of risks that can be classified into several distinct categories, each with its unique set of challenges and implications. These risks encompass the following:
对于国家、开发者和整个社会来说,认识到这些风险的广度和深度至关重要,以确保人工智能技术的部署以最大程度地减少潜在危害的方式进行。这就需要采取积极主动的治理、开发实践和社会参与方法,以负责任地驾驭人工智能进步的复杂格局。
Recognizing the breadth and depth of these risks is crucial for countries, developers, and society at large to ensure that the deployment of AI technologies proceeds in a manner that minimizes potential harm. This necessitates a proactive approach to governance, development practices, and societal engagement to navigate the complex landscape of AI advancements responsibly.
为了减轻偏见等道德问题,并确保在决策过程中负责任地使用人工智能和LLMs,特别是在高风险和受监管的行业,需要采取多方面的方法。这种方法应该解决人工智能系统集成到商业和社会关键领域所带来的技术和社会技术挑战。以下策略可以指导人工智能系统的开发和部署:
To mitigate ethical concerns such as bias and ensure the responsible use of AI and LLMs in decision-making processes, especially in high-risk and regulated industries, a multifaceted approach is required. This approach should address both technical and socio-technical challenges posed by the integration of AI systems into critical areas of business and society. The following strategies can guide the development and deployment of AI systems:
通过采用这些策略,人工智能开发人员和政策制定者可以应对偏见的挑战,并确保人工智能和LLMs得到负责任和有效的使用,特别是在其影响最深远且具有潜在变革性的领域。
By adopting these strategies, AI developers and policymakers can address the challenges of bias and ensure that AI and LLMs are used responsibly and effectively, especially in sectors where their impact is most profound and potentially transformative.
为了从传统角色过渡到人类与人工智能的协作团队,并确保人类专业知识的发展以及工作场所人工智能的集成,多方面的方法至关重要。该战略包括以下内容:
To transition from traditional roles to collaborative human-AI teams and ensure the development of human expertise alongside AI integration in the workplace, a multifaceted approach is essential. This strategy encompasses the following:
通过解决这些关键问题,组织可以营造一个将人工智能工具精心集成到工作场所的环境。这确保了人类的专业知识不仅得到保留,而且得到增强,为未来铺平道路,在未来,人类与人工智能协作团队以对道德负责的方式推动创新、生产力和可持续增长。
By addressing these key issues, organizations can cultivate an environment where AI-enabled tools are integrated thoughtfully into the workplace. This ensures that human expertise is not only preserved but also enhanced, paving the way for a future where collaborative human-AI teams drive innovation, productivity, and sustainable growth in an ethically responsible manner.
在我们探索 NLP 和 LLM 动态世界的最后一章中,我们有幸与各个领域的专家进行交流。他们富有洞察力的讨论阐明了LLMs的复杂发展、法律考虑、运营方法、监管影响和新兴能力。通过他们的专家视角,我们深入研究了紧迫的问题,例如创建公平的数据集、推进 NLP 技术、研究中的隐私保护、围绕人工智能重组组织以及预测学习范式的突破。
In this concluding chapter of our exploration into the dynamic world of NLP and LLMs, we have had the privilege of engaging with experts across various fields. Their insightful discussions have illuminated intricate developments, legal considerations, operational approaches, regulatory influences, and emerging capabilities of LLMs. Through their expert lenses, we delved into pressing issues such as creating equitable datasets, advancing NLP technologies, navigating privacy protections in research, restructuring organizations around AI, and anticipating breakthroughs in learning paradigms.
与这些杰出人物的对话强调了一个共同的主题:技术创新与道德、法律和组织考虑的交叉点。当我们思考减轻数据集中偏差的策略、展望混合学习范式的未来以及评估基础模型对数据所有权的影响时,很明显,NLP 和 LLM 的演变不仅仅是一次技术之旅,而且是一次多学科冒险,挑战我们深入思考这些进步的更广泛影响。
The dialogue with these luminaries has underscored a common theme: the intersection of technological innovation with ethical, legal, and organizational considerations. As we ponder strategies to mitigate biases in datasets, envision the future of hybrid learning paradigms, and assess the impact of foundation models on data ownership, it becomes clear that the evolution of NLP and LLMs is not merely a technological journey but a multidisciplinary venture that challenges us to think deeply about the broader implications of these advancements.
本章作为本书的高潮,将各章讨论的广泛主题联系在一起,从 NLP 的基础知识及其与 ML 的集成,到LLMs的复杂设计、它们的应用以及它们预示的未来趋势。它概括了我们旅程的精髓——强调学术界和工业界之间的合作,在对道德和法律环境的透彻理解的基础上,对于充分发挥LLMs的潜力至关重要。
This chapter, serving as the capstone of our book, ties together the expansive topics discussed throughout the chapters, from the basics of NLP and its integration with ML to the intricate designs of LLMs, their applications, and the trends they herald for the future. It encapsulates the essence of our journey—highlighting how the collaboration between academia and industry, underpinned by a thorough understanding of the ethical and legal landscapes, is crucial for harnessing the full potential of LLMs.
当我们结束本章和本书本身时,我们正站在 NLP 和 LLM 新时代的边缘。我们的专家分享的见解并不标志着终结,而是该领域未来探索和创新的灯塔。本书旨在让读者,无论是来自学术界还是工业界,对 NLP 和 LLM 的发展有全面的理解和远见,鼓励他们通过自己的研究、发展和伦理考虑为这一不断发展的叙述做出贡献。
As we conclude not just this chapter but the book itself, we stand on the precipice of a new era in NLP and LLMs. The insights shared by our experts do not mark an end but a beacon for future exploration and innovation in the field. This book has aimed to furnish readers, whether they come from academia or industry, with a comprehensive understanding and foresight into the evolution of NLP and LLMs, encouraging them to contribute to this ever-evolving narrative with their own research, developments, and ethical considerations.
由于此电子书版本没有固定页码,因此以下页码的超链接仅供参考,基于本书的印刷版。
As this ebook edition doesn't have fixed pagination, the page numbers below are hyperlinked for reference only, based on the printed edition of this book.
A
A
激活函数 145
activation function 145
指数线性单位 (ELU) 146
exponential linear unit (ELU) 146
双曲正切 (tanh) 函数 146
hyperbolic tangent (tanh) function 146
第147层
layer 147
泄漏 ReLU 146
Leaky ReLU 146
修正线性单元 (ReLU) 函数 146
rectified linear unit (ReLU) function 146
S 形函数 145
sigmoid function 145
Softmax 函数 146
softmax function 146
阿达Boost 74
AdaBoost 74
先进的LangChain配置和管道,应用
advanced LangChain configurations and pipelines, applying
付费 LLM(OpenAI 的 GPT)和免费 LLM(来自 Hugging Face),在 222、223之间选择
paid LLM (OpenAI's GPT) and free LLM (from Hugging Face), selecting between 222, 223
QA链,创建 223个
QA chain, creating 223
所需的Python库,安装 222
required Python libraries, installing 222
先进的方法,与链条一起使用
advanced methods, using with chains
element of memory, inserting 225-227
LLM,用于回答问题 224
LLM, used for answering questions 224
输出结构,请求 224
output structure, requesting 224
艾伦NLP 91
AllenNLP 91
泽维尔·阿马特里亚因
Amatriain, Xavier
关于 NLP 和 LLM 的见解 281 , 294 , 295
insights, on NLP and LLMs 281, 294, 295
亚马逊机器映像 (AMI) 215
Amazon Machine Images (AMIs) 215
亚马逊 SageMaker 215
Amazon SageMaker 215
异常检测 81
anomaly detection 81
artificial intelligence (AI) 2, 141
自动编码器(AE) 151
autoencoder (AE) 151
自动生成 236
AutoGen 236
关键能力 237
key capabilities 237
自回归语言建模 155
autoregressive language modeling 155
AWS
AWS
LLMs 215
LLMs 215
LLMs,在215上部署和生产
LLMs, deploying and productionizing on 215
LLMs,正在 215上进行实验
LLMs, experimenting on 215
Azure 认知服务 216
Azure Cognitive Services 216
Azure 容器实例 (ACI) 216
Azure Container Instances (ACI) 216
Azure Kubernetes 服务 (AKS) 216
Azure Kubernetes Service (AKS) 216
Azure 机器学习 (Azure ML) 216
Azure Machine Learning (Azure ML) 216
Azure OpenAI 服务 216
Azure OpenAI Service 216
乙
B
反向传播 148
backpropagation 148
装袋 60
bagging 60
词袋(BOW) 110
bag-of-words (BOW) 110
批量大小 147
batch size 147
选择 148
selecting 148
BERT 基础 161
BERT Base 161
BERT 大 161
BERT Large 161
贝斯以色列女执事医疗中心 (BIDMC) 280
Beth Israel Deaconess Medical Center (BIDMC) 280
Transformers 的双向编码器表示 (BERT) 7 , 93 , 139 , 172 , 160
Bidirectional Encoder Representations from Transformers (BERT) 7, 93, 139, 172, 160
设计 161
design 161
微调 162
fine-tuning 162
fine-tuning, for text classification 163, 164
预训练 162
pretraining 162
黑匣子 152
black boxes 152
增强 74
boosting 74
优点和缺点 75
pros and cons 75
引导聚合(装袋) 73
bootstrap aggregating (bagging) 73
优点和缺点 74
pros and cons 74
字节对编码 (BPE) 161
byte pair encoding (BPE) 161
C
C
集中数据治理
centralized data governance
福利 292
benefits 292
链条
chains
先进的方法,使用 with 224
advanced methods, using with 224
聊天GPT
ChatGPT
实施例 9
example 9
微调过程 183
fine-tuning process 183
预训练过程 183
pretraining process 183
响应,生成 185
response, generating 185
系统级控制 185
system-level controls 185
首席人工智能官(CAIO) 269
chief AI officer (CAIO) 269
核心职责和特质 270
core responsibilities and traits 270
卡方 41
chi-squared 41
聚类 106
clustering 106
代码设置、NLP分类的ML系统设计
code settings, ML system design for NLP classification
特征工程 133
feature engineering 133
数值特征,探索 133
numerical features, exploring 133
preliminary statistical analysis 133, 134
训练/测试集,分割 133
train/test sets, splitting 133
补充活动 22
complementary event 22
复杂分析
complex analysis
团队任务内的人为干预 240
human intervention, within team’s tasks 240
可视化,创建 238
visualization, creating 238
计算能力 246
computation power 246
数字交互和洞察,重塑 247
digital interactions and insights, reshaping 247
未来 247
future 247
目的 246
purpose 246
值 246
value 246
计算能力的进步
computation power advancements
云计算 250
cloud computing 250
高端计算的民主化 250
democratization of high-end computation 250
规模经济和成本效率 248
economies of scale and cost-efficiency 248
能源效率和可持续性 249
energy efficiency and sustainability 249
速度呈指数增长 247
exponential increase in speed 247
NLP 249专用硬件
specialized hardware for NLP 249
条件随机场 (CRF) 89
conditional random fields (CRFs) 89
混淆矩阵 118
confusion matrix 118
上下文压缩
context compression
continuous bag-of-words (CBOW) 103, 115
持续相关性
continuous relevance
通过增量更新和自动监控确保 212
ensuring, through incremental updates and automated monitoring 212
卷积神经网络(CNN) 149
convolutional neural network (CNN) 149
相关数据
correlated data
cost-sensitive learning 80, 81
联合培训 108
co-training 108
累积分布函数(CDF) 24
cumulative distribution function (CDF) 24
D
D
数据
data
数据可用性、未来预测
data availability, future prediction
多样性 252
diversity 252
领域专业知识和专业化 252
domain expertise and specialization 252
隐性偏见 252
implicit biases 252
监管环境 253
regulatory landscapes 253
数据转换 38
data transformation 38
错误,更正 40
errors, correcting 40
缺失值,处理 36
missing values, handling 36
异常值,处理 39
outliers, handling 39
标准化 38
standardizing 38
数据探索 34
data exploration 34
领域知识 35
domain knowledge 35
feature engineering 35, 50, 51
特征选择 40
feature selection 40
统计分析 35
statistical analysis 35
数据标准化 4
data normalization 4
去中心化人工智能
decentralized AI
工作模式福利 291
work model benefits 291
deep learning-based methods 93, 94
深度学习 (DL) 139 , 141 , 166 , 167
deep learning (DL) 139, 141, 166, 167
优点 141
advantages 141
基础知识 141
basics 141
深度神经网络 144
deep neural network 144
对角矩阵 18
diagonal matrix 18
降维技术 46
dimensionality reduction techniques 46
主成分 46
PCA 46
离散随机变量
discrete random variables
分布 23
distribution 23
领域知识 35
domain knowledge 35
乙
E
用数值方法发现 19
finding, with numerical methods 19
Elo 评级系统 187
Elo rating system 187
合奏模型 73
ensemble models 73
装袋 73
bagging 73
随机森林 76
random forests 76
堆叠 75
stacking 75
合奏技巧 81
ensemble techniques 81
纪元 147
epoch 147
欧几里得范数 16
Euclidean norm 16
示例设计、最先进的LLMs
example designs, state-of-the-art LLMs
GPT-3.5 和 ChatGPT 183
GPT-3.5 and ChatGPT 183
GPT-4 190
GPT-4 190
骆驼 191
LLaMA 191
LM 预训练 186
LM Pretraining 186
open-source tools for RLHF 193, 194
reinforcement learning, used to fine-tune 188-190
reward model, training 187, 188
指数线性单位 (ELU) 146
exponential linear unit (ELU) 146
F
F
Facebook 人工智能相似性搜索(FAISS) 213
Facebook AI Similarity Search (FAISS) 213
feature engineering 35, 50, 51
对数变换 53
logarithmic transformation 53
多项式展开 52
polynomial expansion 52
特征缩放、方法
feature scaling, methods
对数变换 51
log transformation 51
最小-最大缩放 50
min-max scaling 50
电源变换 51
power transformation 51
鲁棒缩放 51
robust scaling 51
标准化 51
standardization 51
特征选择
feature selection
降维技术 46
dimensionality reduction techniques 46
嵌入式方法 44
embedded methods 44
过滤方法 41
filter methods 41
套索 44
LASSO 44
LASSO 或岭回归,选择 46
LASSO or ridge regression, selecting 46
岭回归 45
ridge regression 45
特征选择、NLP 分类的 ML 系统设计
feature selection, ML system design for NLP classification
机器学习建模 136
ML modeling 136
前馈神经网络(FNN) 148
feedforward neural network (FNN) 148
过滤方法 41
filter methods 41
卡方 41
chi-squared 41
correlation coefficients 42, 43
相互信息 42
mutual information 42
天赋 91
Flair 91
G
G
梅兰妮·加森
Garson, Melanie
关于 NLP 和 LLM 的见解 281 , 295 - 298
insights, on NLP and LLMs 281, 295-298
门控循环单元 (GRU) 142
gated recurrent units (GRUs) 142
GCP
GCP
LLMs, 218 人进行部署和生产
LLMs, deploying and productionizing with 218
LLMs,正在试验 217
LLMs, experimenting with 217
文本工程通用架构 (GATE) 91
General Architecture for Text Engineering (GATE) 91
一般数据保护法 (LGPD) 289
General Data Protection Law (LGPD) 289
一般数据保护条例 (GDPR) 289
General Data Protection Regulation (GDPR) 289
一般语言理解评估(GLUE) 175
General Language Understanding Evaluation (GLUE) 175
生成对抗网络(GAN) 151
generative adversarial network (GAN) 151
生成人工智能(GenAI) 298
generative AI (GenAI) 298
生成式预训练 Transformer 3 (GPT-3) 164
generative pretrained transformer 3 (GPT-3) 164
建筑 164
architecture 164
业务目标 165
business objective 165
数据,格式化 167
data, formatting 167
deep learning language model, employing 166, 167
设计 164
design 164
评估指标 167
evaluation metric 167
少样本学习 165
few-shot learning 165
一次性设置 165
one-shot setting 165
管道 166
pipeline 166
技术目标 165
technical objective 165
训练对象 167
trainer object 167
用例,回顾 165
use case, reviewing 165
使用、挑战 165
using, challenges 165
零次设置 165
zero-shot setting 165
生成式预训练 Transformer (GPT) 7 , 139 , 172
generative pre-trained transformers (GPTs) 7, 139, 172
GPT-4 190
GPT-4 190
GPT 型号 183
GPT model 183
图形处理单元 (GPU) 246
graphics processing units (GPUs) 246
网格搜索 72
grid search 72
H
H
哈拉姆卡,约翰·D
Halamka, John D
insights, on NLP and LLMs 280-293
健康保险流通和责任法案 (HIPAA) 291
Health Insurance Portability and Accountability Act (HIPAA) 291
隐藏层 147
hidden layers 147
hidden Markov model (HMM) 92, 153, 173
分层狄利克雷过程 (HDP) 105
hierarchical dirichlet process (HDP) 105
住户矩阵 (H) 30
Householder matrix (H) 30
抱脸
Hugging Face
型号 205的轮毂
hub of models 205
模型。选择 205
model. choosing 205
Hushing Face 的 LLMs
Hugging Face’s LLMs
employing, via Python 205, 206
团队任务内的人为干预 240
human intervention, within team’s tasks 240
代理,定义 242
agents, defining 242
实验结果,回顾 241
experiment results, reviewing 241
小组对话,定义 242
group conversation, defining 242
task to be fulfilled by team, defining 241, 242
团队,部署 242
team, deploying 242
团队成员角色,分配 242
team member roles, assigning 242
团队的判断,评估 243
team’s judgement, evaluating 243
hyperbolic tangent (tanh) function 145, 146
我
I
imbalanced data, handling 77, 78
cost-sensitive learning 80, 81
射击 78
SMOTE 78
隐式语言 Q 学习 (ILQL) 194
implicit language Q-learning (ILQL) 194
信息
information
从网络资源检索 228
retrieving, from web sources 228
输入
input
验证 95
validating 95
内-外-开始 (IOB) 90
Inside-Outside-Beginning (IOB) 90
知识产权 (IP) 289
intellectual property (IP) 289
四分位数间距 (IQR) 51
interquartile range (IQR) 51
在线迭代 RLHF 190
iterated online RLHF 190
J
J
Jupyter笔记本
Jupyter notebook
LangChain设置,回顾 212
LangChain setup, reviewing 212
L
L
标签传播 108
label propagation 108
浪链
LangChain
链条 209
chains 209
数据,未预嵌入 209
data, not pre-embedded 209
设计理念 207
design concepts 207
长期记忆 211
long-term memory 211
管道,使用 Python 221
pipeline, with Python 221
参考之前的谈话 211
referring, to prior conversations 211
LangChain管道,使用Python
LangChain pipeline, with Python
先进的LangChain配置和管道,应用 222
advanced LangChain configurations and pipelines, applying 222
付费LLMs与免费LLMs 222
paid LLMs, versus free LLMs 222
required Python libraries, installing 213, 214
设置 213
setting up 213
浪链设置
LangChain setup
审查,在 Jupyter 笔记本 212中
reviewing, in Jupyter notebook 212
语言模型 153
language model 153
半监督学习 154
semi-supervised learning 154
training, with self-supervised learning 154, 155
迁移学习 155
transfer learning 155
无监督学习 154
unsupervised learning 154
语言模型、架构
language models, architectures
伯特 160
BERT 160
GPT-3 164
GPT-3 164
大数据集
large datasets
值 251
value 251
large language models (LLMs) 159, 220, 245
变革管理,按人工智能影响 267
change management, by AI impact 267
云服务,结论 218
cloud services, concluding 218
复杂分析,完成 237
complex analysis, completing 237
文化趋势 263
cultural trends 263
客户互动和服务 267
customer interactions and service 267
在 Azure 217上部署和生产
deploying, and productionizing on Azure 217
在 GCP 218上部署和生产
deploying, and productionizing on GCP 218
部署在 AWS 215上
deploying, on AWS 215
在 AWS 215上进行实验
experimenting, on AWS 215
developing, challenges 179-181
developing, motivations 174-178
在 AWS 215上进行实验
experimenting, on AWS 215
在 Azure 216上进行实验
experimenting, on Azure 216
在 GCP 217上进行实验
experimenting, on GCP 217
在 AWS 215上进行实验
experimenting, with on AWS 215
进化 254
evolution 254
特点 174
features 174
隐马尔可夫模型(HMM) 173
Hidden Markov models (HMMs) 173
影响 255
impact 255
AWS 215
n AWS 215
在云端 214
in cloud 214
在 GCP 217中
in GCP 217
在微软Azure 216中
in Microsoft Azure 216
internal business structure and operations, shifts 267-271
n-gram 模型 173
n-gram models 173
生产化,在 AWS 215上
productionizing, on AWS 215
递归神经网络(RNN) 173
recurrent neural networks (RNNs) 173
学习方案和深度学习架构中的细化 256
refinement, in learning schemes and deep learning architectures 256
团队,组建 234
team, forming 234
类型 182
types 182
用于作为代码生成器进行编程 261
used, for programming as code generators 261
值 255
value 255
与语言模型 172
versus language models 172
working, potential advantages 234-236
潜在狄利克雷分配 (LDA) 47 , 48 , 103 , 107
latent Dirichlet allocation (LDA) 47, 48, 103, 107
泄漏 ReLU 146
Leaky ReLU 146
least absolute shrinkage and selection operator (LASSO) 44, 135
引理 88
lemma 88
用于 NLP 任务的库
libraries, for NLP tasks
艾伦NLP 91
AllenNLP 91
天赋 91
Flair 91
91号登机口
GATE 91
NLTK 91
NLTK 91
斯帕西 91
spaCy 91
斯坦福命名实体识别器 (NER) 91
Stanford Named Entity Recognizer (NER) 91
优点 55
advantages 55
缺点 55
disadvantages 55
骆驼 191
LLaMA 191
LLMs申请
LLM application
开源和闭源,区别方面 204
open source and closed source, distinguishing aspects 204
远程 LLM 提供商,选择 200
remote LLM provider, selecting 200
LLMs语言 231
LLMLingua 231
LLMs操作(LLMOps)
LLM operations (LLMOps)
用于操作和维护 262
used, for operations and maintenance 262
LLMs表现
LLM performance
增强,与LangChain 221
enhancing, with LangChain 221
增强,用 RAG 221
enhancing, with RAG 221
LLMs类型
LLMs types
Transformer型号 182
transformer models 182
对数变换 53
logarithmic transformation 53
优点 56
advantages 56
缺点 57
disadvantages 57
对数似然 26
log-likelihood 26
门控循环单元(GRU)网络 173
gated recurrent unit (GRU) networks 173
长短期记忆(LSTM) 142
long short-term memory (LSTMs) 142
低阶适应(LoRA) 189
Low-Rank Adaptation (LoRA) 189
中号
M
机器语言(ML)
machine language (ML)
NLP,整合 6
NLP, integrating with 6
machine learning (ML) 103, 245
机器学习 (ML)、概率
machine learning (ML), probability
离散随机变量 23
discrete random variables 23
概率密度函数 (PDF) 24
probability density function (PDF) 24
statistically independent 22, 23
机器学习模型
machine learning models
逻辑回归 56
logistic regression 56
机器处理、自然语言
machine processing, natural language
掩码语言建模(MLM) 155
masked language modeling (MLM) 155
矩阵 14
matrices 14
基本操作 16
basic operations 16
矩阵定义
matrix definitions
矩形对角矩阵 17
rectangular diagonal matrix 17
对称矩阵 17
symmetric matrix 17
矩阵转置 16
matrix transpose 16
最大似然估计(MLE) 26
maximum likelihood estimate (MLE) 26
尼赞·梅克尔·博布罗夫
Mekel-Bobrov, Nitzan
insights, on NLP and LLMs 280-284
方法,用于处理缺失值
methods, for handling missing value
删除 第 36行
dropping rows 36
K-最近邻插补 37
K-nearest neighbor imputation 37
平均值/中位数/众数插补 37
mean/median/mode imputation 37
多重插补 37
multiple imputation 37
回归插补 37
regression imputation 37
方法,防止过拟合
methods, to prevent overfitting
交叉验证 68
cross-validation 68
数据增强 69
data augmentation 69
辍学 69
dropout 69
提前停止 68
early stopping 68
集成方法 69
ensemble methods 69
正则化 68
regularization 68
微软Azure
Microsoft Azure
LLMs 216
LLMs 216
LLMs, 216 人进行部署和生产
LLMs, deploying and productionizing with 216
LLMs,正在尝试 216
LLMs, experimenting with 216
混合专家 (MoE) 294
Mixtures of Experts (MoEs) 294
机器学习建模 136
ML modeling 136
NLP 分类的 ML 系统设计
ML system design for NLP classification
特征选择 135
feature selection 135
型号 136
model generation 136
示范应用 111
model application 111
准确度 117
accuracy 117
F1宏 118
F1 macro 118
F1微 118
F1 micro 118
F1成绩 117
F1 score 117
精度 117
precision 117
回忆 117
recall 117
模型生成、NLP 分类的 ML 系统设计
model generation, ML system design for NLP classification
设计 136
design 136
性能 136
performance 136
模型训练 111
model training 111
发展LLMs的动机 174
motivations, for developing LLMs 174
复杂的背景 177
complex contexts 177
少样本学习 177
few-shot learning 177
类人文本生成 178
human-like text generation 178
提高性能 174
improved performance 174
多语言能力 177
multilingual capabilities 177
多层感知器(MLP) 149
multilayer perceptron (MLP) 149
多语言能力 177
multilingual capabilities 177
跨语言语言模型(XLM) 177
cross-lingual language model (XLM) 177
distilBERT 多语言 178
distilBERT Multilingual 178
玛丽安MT 178
MarianMT 178
mBERT(多语言 BERT 177
mBERT (Multilingual BERT 177
T2T (T5) 多语言 178
T2T (T5) Multilingual 178
XLM-罗伯塔 178
XLM-RoBERTa 178
多代理团队 243
multiple-agent team 243
多个检索源 209
multiple retrieval sources 209
互斥 23
mutually exclusive 23
氮
N
朴素贝叶斯 106
Naive Bayes 106
named entity recognition (NER) 89, 90
应用 90
applications 90
类别 89
categories 89
数据收集 90
data collection 90
部署 90
deployment 90
评价 90
evaluation 90
标签 90
labeling 90
预处理 90
preprocessing 90
preprocessing, code examples 100, 101
培训 90
training 90
工作 89
working 89
自然语言处理 (NLP) 2 , 85 , 103 , 139 , 150 , 245
natural language processing (NLP) 2, 85, 103, 139, 150, 245
商业世界 263
business world 263
文化趋势 263
cultural trends 263
数据可用性,未来 252
data availability, future 252
历史与演变 2
history and evolution 2
与 ML 6集成
integration, with ML 6
语言模型,ChatGPT 示例 9
language models, ChatGPT example 9
小写 86
lowercasing 86
自然语言理解(NLU) 175
natural language understanding (NLU) 175
纪元 147
epoch 147
隐藏层 144
hidden layer 144
输入层 143
input layer 143
输出层 144
output layer 144
神经网络,架构 148
neural network, architecture 148
自动编码器(AE) 151
autoencoder (AE) 151
卷积神经网络(CNN) 149
convolutional neural network (CNN) 149
前馈神经网络(FNN) 148
feedforward neural network (FNN) 148
生成对抗网络(GAN) 151
generative adversarial network (GAN) 151
多层感知器(MLP) 149
multilayer perceptron (MLP) 149
递归神经网络(RNN) 150
recurrent neural network (RNN) 150
神经网络参数
neural network parameters
微调 167
fine-tuning 167
神经元/节点 144
neuron/node 144
激活函数 145
activation function 145
偏差 145
bias 145
加权和 145
weighted sum 145
重量 145
weights 145
神经元 62
neurons 62
n-gram 模型 173
n-gram models 173
NLTK 91
NLTK 91
氧
O
独热编码向量
one-hot encoding vector
在文本分类中使用 108
using, in text classification 108
OpenAI的GPT模型
OpenAI’s GPT model
异常值处理
outliers handing
方法 39
methods 39
词汇外 (OOV) 99
out-of-vocabulary (OOV) 99
磷
P
皮尔逊 42
Pearson 42
宾夕法尼亚树库标签集 94
Penn Treebank tagset 94
管道
pipeline
多项式展开 52
polynomial expansion 52
位置编码 158
positional encoding 158
阳性预测值 (PPV) 117
positive predictive value (PPV) 117
deep learning-based methods 93, 94
preprocessing, code examples 100, 101
正则表达式 94
regular expression 94
统计方法 92
statistical methods 92
principal component analysis (PCA) 20, 36
之前 167
prior 167
probability density function (PDF) 24, 25
maximum likelihood estimation 26-29
概率质量函数 (PMF) 23
probability mass function (PMF) 23
即时压缩 231
prompt compression 231
代码设置 232
code settings 232
数据,收集 233
data, gathering 233
实验 233
experiments 233
LLMs配置 233
LLM configurations 233
trade-offs, evaluating 231, 232
prompt engineering 201, 202, 256, 257
近端策略优化(PPO) 184
proximal policy optimization (PPO) 184
标点
punctuation
删除 86
removing 86
右
R
优点 59
advantages 59
缺点 60
disadvantages 60
随机搜索 72
random search 72
矩形对角矩阵 17
rectangular diagonal matrix 17
Rectified Linear Unit (ReLU) function 145, 146
循环神经网络 (RNN) 6 , 89 , 139 , 150 , 173
recurrent neural networks (RNNs) 6, 89, 139, 150, 173
长短期记忆(LSTM)网络 173
long short-term memory (LSTM) networks 173
递归特征消除(RFE) 43
recursive feature elimination (RFE) 43
正则表达式 94
regular expression 94
输入,验证 95
input, validating 95
文本清理 96
text cleaning 96
使用、数据提取步骤 95
using, steps for data extraction 95
正则表达式标记化 98
regular expression tokenization 98
语言模型的强化学习(RL4LM) 194
Reinforcement Learning for Language Models (RL4LMs) 194
来自人类反馈的强化学习 (RLHF) 183 - 185 , 254
reinforcement learning from human feedback (RLHF) 183-185, 254
远程LLM提供者,选择
remote LLM provider, selecting
OpenAI 的远程 GPT 访问,通过 API 200在 Python 中
OpenAI’s remote GPT access, in Python via API 200
重采样 78
resampling 78
检索 增强生成( RAG ) 257、258、206、220
retrieval-augmented generation (RAG) 257, 258, 206, 220
优点 259
advantages 259
应用 260
applications 260
挑战 259
challenges 259
数据集成 258
data integration 258
数据转换 259
data transformation 259
用户交互 259
user interaction 259
岭回归 45
ridge regression 45
S
S
标量 14
scalars 14
缩放点积注意力 158
scaled dot-product attention 158
self-attention mechanism 158, 182
自我监督学习
self-supervised learning
using, to train language models 154, 155
semi-supervised learning 105, 107, 154
联合培训 108
co-training 108
标签传播 108
label propagation 108
句子标记化 98
sentence tokenization 98
情绪分析 87
sentiment analysis 87
基于顺序模型的优化 (SMBO) 72
sequential model-based optimization (SMBO) 72
S 形函数 145
sigmoid function 145
实验可视化的意义
significance of experiments visualization
代理,定义 238
agents, defining 238
创造 238
creating 238
小组对话,定义 239
group conversation, defining 239
第 238章
task to be fulfilled, defining 238
团队成员角色,分配 238
team members roles, assigning 238
简单的线性方程 54
simple linear equation 54
singular value decomposition (SVD) 20, 21
跳跃语法 115
skip-gram 115
Softmax 函数 146
softmax function 146
桑塔格、大卫
Sontag, David
关于 NLP 和LLMs的见解 280 , 285 , 286 , 287 , 288
insights, on NLP and LLMs 280, 285, 286, 287, 288
斯帕西 91
spaCy 91
特殊字符
special characters
删除 86
removing 86
拼写检查和更正 87
spell checking and correction 87
范数平方 16
squared norm 16
堆叠 75
stacking 75
标准化 38
standardization 38
斯坦福命名实体识别器 (NER) 91
Stanford Named Entity Recognizer (NER) 91
最先进的LLMs
state-of-the-art LLMs
设计示例 183
example designs 183
统计分析 35
statistical analysis 35
统计语言建模 6
statistical language modeling 6
统计方法 92
statistical methods 92
优点 93
advantages 93
缺点 93
disadvantages 93
茎 88
stem 88
随机梯度下降(SGD) 115
stochastic gradient descent (SGD) 115
停用词 87
stop word 87
停用词删除 87
stop word removal 87
结构化状态空间模型(SSM) 294
Structured State Space Models (SSMs) 294
主题专家 (SME) 295
subject-matter experts (SMEs) 295
误差平方和 (SSE) 46
sum of squared errors (SSE) 46
逻辑回归 106
logistic regression 106
朴素贝叶斯 106
Naive Bayes 106
支持向量机(SVM) 106
support vector machines (SVMs) 106
support vector machines (SVMs) 60, 61, 104, 106
优点 61
advantages 61
缺点 62
disadvantages 62
对称矩阵 17
symmetric matrix 17
Synthetic Minority Oversampling Technique (SMOTE) 78, 79
优点和缺点 79
pros and cons 79
系统提示 201
system prompt 201
时间
T
t 分布随机邻域嵌入 (t-SNE) 36
t-distributed stochastic neighbor embedding (t-SNE) 36
张量处理单元 (TPU) 249
tensor processing units (TPUs) 249
term frequency-inverse document frequency (TF-IDF) 103, 112
using, in text classification 112-114
文本分类
text classification
半监督学习 107
semi-supervised learning 107
unsupervised learning 106, 107
与 Word2vec 116
with Word2vec 116
文本分类,使用 one-hot 编码向量 108
text classification, with one-hot encoding vector 108
模型训练 111
model training 111
N 元语法 111
N-grams 111
文本预处理 109
text preprocessing 109
vocabulary construction 109, 110
文本分类,使用 Word2vec 114
text classification, with Word2vec 114
模型评估 117
model evaluation 117
文本清理 96
text cleaning 96
使用正则表达式 96
using, regular expressions 96
文本处理 95
text manipulation 95
文本到文本传输Transformer (T5) 178
Text-to-Text Transfer Transformer (T5) 178
时间序列交叉验证 71
time series cross-validation 71
正则表达式标记化 98
regular expression tokenization 98
句子标记化 98
sentence tokenization 98
单词标记化 98
word tokenization 98
代币 98
tokens 98
主题建模 87
topic modeling 87
训练对象 167
trainer object 167
神经网络参数,微调 167
neural network parameters, fine-tuning 167
测试结果,生成 168
testing results, generating 168
训练配置 167
training configurations 167
训练结果,生成 168
training results, generating 168
迁移学习 155
transfer learning 155
例子 156
examples 156
特征提取 156
feature extraction 156
微调 156
fine-tuning 156
Transformer型号 182
transformer models 182
远程依赖 182
long-range dependencies 182
可扩展性 182
scalability 182
速度 182
speed 182
Transformer 63 , 64 , 157 , 166
应用 158
applications 158
transformers, architecture 157, 158
位置编码 158
positional encoding 158
自注意力机制 158
self-attention mechanism 158
变形金刚强化学习(TRL) 194
Transformers Reinforcement Learning (TRL) 194
三角矩阵 18
triangular matrix 18
真阳性率 (TPR) 117
true positive rate (TPR) 117
U
U
欠采样 78
undersampling 78
一元语言模型(ULM) 161
unigram language model (ULM) 161
并集和交集 23
union and intersection 23
144单元
unit 144
unsupervised learning 105-107, 154
上(或下)三角矩阵 17
upper (or Lower) triangular matrix 17
用户提示 201
user prompt 201
V
V
梯度消失问题 145
vanishing gradients problem 145
矢量 14
vectors 14
基本操作 17
basic operations 17
vocabulary construction 109, 110
瓦
W
缩尾 39
winsorizing 39
using, in text classification 114, 116
WordPiece 标记化 98
WordPiece tokenization 98
单词预测 27
word prediction 27
单词标记化 98
word tokenization 98
包装方法 43
wrapper methods 43
是
Y
Youtube 视频
YouTube video
内容,检索自 228
content, retrieving from 228
安装和导入 228
installs and imports 228
检索机制,设置 229
retrieval mechanism, setting up 229
reviewing, and summarizing 229, 230
Z
Z
Z 分数标准化 38
Z-score normalization 38
订阅我们的在线数字图书馆,全面访问 7,000 多本书籍和视频,以及行业领先的工具,帮助您规划个人发展并推进您的职业生涯。欲了解更多信息,请访问我们的网站。
Subscribe to our online digital library for full access to over 7,000 books and videos, as well as industry leading tools to help you plan your personal development and advance your career. For more information, please visit our website.
您是否知道 Packt 提供每本出版书籍的电子书版本,并提供 PDF 和 ePub 文件?您可以在www.packtpub.com上升级到电子书版本,作为纸质图书客户,您有权享受电子书副本的折扣。请通过customercare@packtpub.com联系我们了解更多详情。
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.packtpub.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at customercare@packtpub.com for more details.
在www.packtpub.com上,您还可以阅读一系列免费技术文章、订阅一系列免费新闻通讯,并获得 Packt 书籍和电子书的独家折扣和优惠。
At www.packtpub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on Packt books and eBooks.
如果您喜欢这本书,您可能对 Packt 的其他书籍感兴趣:
If you enjoyed this book, you may be interested in these other books by Packt:
OpenAI API 手册
OpenAI API Cookbook
亨利·哈比卜
Henry Habib
国际标准书号:978-1-80512-135-0
ISBN: 978-1-80512-135-0
使用DALL-E 3生成创意图像
Generating Creative Images With DALL-E 3
霍莉·皮卡诺
Holly Picano
国际标准书号:978-1-83508-771-8
ISBN: 978-1-83508-771-8
如果您有兴趣成为 Packt 的作者,请访问authors.packtpub.com并立即申请。我们与数千名像您一样的开发人员和技术专业人士合作,帮助他们与全球技术社区分享他们的见解。您可以提出一般申请,申请我们正在招募作者的特定热门主题,或者提交您自己的想法。
If you’re interested in becoming an author for Packt, please visit authors.packtpub.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.
现在您已经完成了从基础到LLMs的 NLP 掌握,我们很想听听您的想法!如果您从亚马逊购买了该书,请单击此处直接进入该书的亚马逊评论页面并分享您的反馈或在您购买该书的网站上留下评论。
Now you’ve finished Mastering NLP from Foundations to LLMs, we’d love to hear your thoughts! If you purchased the book from Amazon, please click here to go straight to the Amazon review page for this book and share your feedback or leave a review on the site that you purchased it from.
您的评论对我们和技术社区都很重要,并将帮助我们确保提供优质的内容。
Your review is important to us and the tech community and will help us make sure we’re delivering excellent quality content.
感谢您购买本书!
Thanks for purchasing this book!
您是否喜欢在旅途中阅读,但无法随身携带纸质书籍?
Do you like to read on the go but are unable to carry your print books everywhere?
您购买的电子书是否与您选择的设备不兼容?
Is your eBook purchase not compatible with the device of your choice?
不用担心,现在对于每本 Packt 书籍,您都可以免费获得该书的无 DRM 的 PDF 版本。
Don’t worry, now with every Packt book you get a DRM-free PDF version of that book at no cost.
随时随地在任何设备上阅读。从您喜爱的技术书籍中搜索、复制代码并将其直接粘贴到您的应用程序中。
Read anywhere, any place, on any device. Search, copy, and paste code from your favorite technical books directly into your application.
福利还不止于此,您每天都可以在收件箱中独家获得折扣、新闻通讯和精彩的免费内容
The perks don’t stop there, you can get exclusive access to discounts, newsletters, and great free content in your inbox daily
请遵循以下简单步骤即可获得好处:
Follow these simple steps to get the benefits:
https://packt.link/free-ebook/978-1-80461-918-6
https://packt.link/free-ebook/978-1-80461-918-6